-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow multiple kmask registers to be allocated and cleanup some codegen around them #89059
Conversation
#if defined(TARGET_XARCH) | ||
// xarch has mask registers available when EVEX is supported | ||
|
||
if (compiler->canUseEvexEncoding()) | ||
{ | ||
for (unsigned int i = 0; i < lsraRegOrderMskSize; i++) | ||
{ | ||
regNumber reg = lsraRegOrderMsk[i]; | ||
RegRecord* curr = &physRegs[reg]; | ||
curr->regOrder = (unsigned char)i; | ||
} | ||
} | ||
#endif // TARGET_XARCH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much of the TP regression comes from this. LSRA is very sensitive to the additional cost from building these extra 8 registers.
src/coreclr/jit/lsrabuild.cpp
Outdated
#if defined(TARGET_XARCH) | ||
killMask &= ~RBM_MSK_CALLEE_TRASH; | ||
#endif // TARGET_XARCH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps interestingly, having this vs not having it makes a difference to the integer registers allocated.
That, in general, seems odd and one would expect that the kill mask including or excluding floating/mask registers shouldn't impact how integer registers are allocated (particularly when there are no floating-point or mask registers actually in use).
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs
Outdated
Show resolved
Hide resolved
@@ -92,7 +92,9 @@ | |||
#define REG_MASK_FIRST REG_K0 | |||
#define REG_MASK_LAST REG_K7 | |||
|
|||
#define RBM_ALLMASK RBM_K1 | |||
#define RBM_ALLMASK_INIT (0) | |||
#define RBM_ALLMASK_EVEX (RBM_K1 | RBM_K2 | RBM_K3 | RBM_K4 | RBM_K5 | RBM_K6 | RBM_K7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notably this doesn't include K0
because it is a bit special.
In particular, while K0
can be used as an input or an output of many of the kmask instructions
, it cannot be used as a predicate to many of the SIMD instructions because it instead represents "no predication".
We should likely ensure K0
can still be used longer term, but for this PR it's unnecessary.
case NI_AVX512F_Add: | ||
case NI_AVX512BW_Add: | ||
case NI_AVX512F_And: | ||
case NI_AVX512DQ_And: | ||
case NI_AVX512F_AndNot: | ||
case NI_AVX512DQ_AndNot: | ||
case NI_AVX512F_Or: | ||
case NI_AVX512DQ_Or: | ||
case NI_AVX512F_Xor: | ||
case NI_AVX512DQ_Xor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This covers the core bitwise patterns around kmask
registers. It notably doesn't factor in knot
, kshift
, or kxnor
but those are somewhat less common and can be handled in the future.
// We want to ensure that we get a TYP_MASK local to | ||
// ensure the relevant optimizations can kick in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ensures we don't typically end up with a local hiding the ConvertMaskToVector
and results in overall better codegen since most cases allow it to be elided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this is not needed for ConvertVectorToMask
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. We just care about seeing the TYP_MASK
node since it needs special handling. TYP_SIMD
nodes are already well handled throughout.
I extracted it to an isolated PR, but its worth noting it is not completely unrelated to the kmask register change. We need to check and handle if the register type is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm, the TP regression is coming from creating the register mask registers (as you pointed out), but its impact should be similar on Minopts vs. Regular. Do we know why MinOpts shows higher TP regression? I am also don't understand why replacing the varTypeIsFloating
with varTypeUsesIntReg
would cause it. Did you find out the assembly difference between the two?
(numbers are similar for linux/x64):
Also, do we know why asmdiffs for linux/x64 is way less (858 bytes) than windows/x64 (11K bytes)?
Could you confirm why clang shows TP regression on arm64?
DEF_TP(MASK ,"mask" , TYP_MASK, 8, 8, 8, 2, 8, VTR_MASK, availableMaskRegs, VTF_ANY) | ||
DEF_TP(SIMD32 ,"simd32" , TYP_SIMD32, 32,32, 32, 8,16, VTR_FLOAT, availableDoubleRegs, RBM_FLT_CALLEE_SAVED, RBM_FLT_CALLEE_TRASH, VTF_S|VTF_VEC) | ||
DEF_TP(SIMD64 ,"simd64" , TYP_SIMD64, 64,64, 64, 16,16, VTR_FLOAT, availableDoubleRegs, RBM_FLT_CALLEE_SAVED, RBM_FLT_CALLEE_TRASH, VTF_S|VTF_VEC) | ||
DEF_TP(MASK ,"mask" , TYP_MASK, 8, 8, 8, 2, 8, VTR_MASK, availableMaskRegs, RBM_MSK_CALLEE_SAVED, RBM_MSK_CALLEE_TRASH, VTF_S) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier, was this mistakenly marked as VTF_ANY
instead of VTF_S
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It is strictly a struct and should be involved in struct like copying and other optimizations.
src/coreclr/jit/lowerxarch.cpp
Outdated
@@ -1927,11 +2002,16 @@ GenTree* Lowering::LowerHWIntrinsicCmpOp(GenTreeHWIntrinsic* node, genTreeOps cm | |||
|
|||
default: | |||
{ | |||
unreached(); | |||
maskIntrinsicId = NI_AVX512F_NotMask; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate how all the default cases sets NI_AVX512F_NotMask
? Can we add some assert to make sure that we are setting this for correct intrinsicid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's already a couple lengthy comment above this code elaborating on the overall needs and the needs of this path in particular.
For the cases where we have a partial mask, we need to invert the comparison. This is because matches
become 1
and bits are otherwise 0
, including any bits that are unused in the comparison.
Consider for example if we had Vector128<int>
and so we have 4-bits
in the mask, but the mask instruction always checks at least 8-bits
. If we did CompareEqual
and all elements matched, we'd still get 0b0000_1111
and so we couldn't easily check "are all matching". However, if we invert the comparison and do CompareNotEqual
we get 0b0000_0000
for all matches and can then trivially check "are all matching".
This is meant to handle the general case of "other mask intrinsic" where we don't have a well known inverse. Notably I think I forgot a bit of code here, we need to either invert just the lower n
bits or always set the upper 4-bits. Will fix to ensure just that is happening
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs
Outdated
Show resolved
Hide resolved
@@ -3968,6 +3998,12 @@ int LinearScan::BuildReturn(GenTree* tree) | |||
{ | |||
buildInternalIntRegisterDefForNode(tree, dstRegMask); | |||
} | |||
#if defined(TARGET_XARCH) && defined(FEATURE_SIMD) | |||
else if (varTypeUsesMaskReg(dstType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this check be guarded for AVX512 platforms only to reduce the impact of TP on non-AVX512 xarch platforms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
varTypeUsesMaskReg
is basically doing a dstType == TYP_MASK
check (and we could actually optimize it to that given we only have the 1 mask type today).
Checking comp->canUseEvexEncoding
would just add an additional branch and not really save us much (we'd be trading one comparison/branch for another). It may even and potentially slow down the actual execution due to more branches being in a small code window which can pessimize the branch predictor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, that's what I thought, because it is still a runtime check and will cost us. In LSRA, I think this check would have increased the TP regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. It will increase the cost to get a floating-point internal register. But, that cost basically "has" to exist; we're either going to have else if (comp->canUseEvexEncoding && varTypeUsesMaskReg(dstType))
-or- else if (varTypeUsesMaskReg(dstType))
-or- else if (varTypeUsesFloatReg(dstType))
The first one is a memory access
, branch
, comparison
, and branch
The second one is currently just a comparison
and branch
.
The third is currently a memory access
, comparison
, and branch
So what we have currently is the cheapest we can make it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could potentially have a small lookup table that maps type
to a fnptr
representing the buildInternalReg
call. However, that's potentially more expensive in terms of actual execution cost.
I'd think there are better ways to win the TP back instead.
They won't and shouldn't be similar as the number of total instructions executed for You can see the following, for example, where
|
Linux x64 doesn't include
TP is a measurement of instructions executed, not strictly a measurement of perf. Clang prefers to emit more instructions by default, generating code that it believes will take less time to execute and so it will almost always show a worse TP regression than MSVC. In practice the code will take similar amounts of time to execute end to end. There are a great number of scenarios where longer codegen (more instructions) are better than shorter codegen (less instructions), as well as many scenarios where a change may not be on a hot path, may get removed via PGO, may actually execute more code but result in faster execution time, etc. The only way to know for sure if a change like this is good or bad is to actually measure the average execution time for various minopts vs fullopts scenarios. |
} | ||
|
||
// Not all of the callee trash values are constant, so don't declare this as a method local static |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not for this PR, but should we refactor this at other places as well, if similar situation exists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other places are currently all constant, from what I saw at least. If we have any other non-constant places introduced in this PR then we should update them
I meant that for arm64, this should have shown improvement regardless of msvc or clang, and wanted to see the diff (if possible) to understand why arm64 has regression only on clang. |
That's making an assumption that Clang will generate the same code as MSVC, when they can frequently differ. In this case it was because the This should be resolved now that I've changed it to be an array that's initialized when the LSRA is created instead. |
a89d145
to
14f269b
Compare
Seems there is a new test failure:
|
Yes, I'm looking into it, but I'm not convinced its related. It also showed up on an previously passing commit and there hasn't been any change to the constant handling for the code bit in question (the only change was from using |
Found the issue. There was an edge case under DPGO where we were getting a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Various pessimizations around the kmask handling were found which were causing suboptimal performance numbers for various algorithms updated to support V512.
To address this, the PR ensures that kmask registers are properly supported in the register allocator and that the core patterns necessary are recognized around them to get efficient codegen.
The asmdiffs, particularly for Windows x64 are very positive.
The TP impact, however, is not great and comes from LSRA needing additional checks in many paths to handle the new register kind. I don't think there is a trivial fix for this as an additional register file simply requires more checks/handling.
The Arm64 TP impact comes from the fix to calleeSaveRegs/callerSaveRegs called out here: https://github.com/dotnet/runtime/pull/89059/files#r1269918053. Namely they were doing an
IsType
rather than aUsesReg
check and were technically returning the wrongCALLEE_SAVED
set forTYP_STRUCT
. Its unclear why this is viewed as more expensive given that the two checks are doing similar checks (IsType
does((varTypeClassification[TypeGet(vt)] & (VTF_INT)) != 0);
, whileUsesReg
doesvarTypeRegister[TypeGet(vt)] == VTR_INT
. This is really just the difference betweentest; jcc
vscmp; jcc