-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve/cleanup RyuJIT's IR handling of operand lists #11058
Comments
At a minimum, we should treat such lists consistently - if call's list nodes aren't visited by And once you stop traversing list nodes the next question comes: do they need to be |
Actually it is not. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Tried to change PHI to use its own linked list nodes instead of Same for It's very likely that calls also do not need to use HW intrinsics are a bit of a problem. Since their |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Unlike The only question is what to do about |
I think we are in a slightly better position here now. Previously, we had the So I think (at least for the I don't think we can workaround having 3 operands (since things like |
Yes, and 4 is already more that I can fit in a small node.
Hmm? But the constant problem is already solved in dotnet/coreclr#14393. We just need to extend that to 32 byte constants, that requires VM support to allocate 32 byte aligned data sections. Shouldn't be a big problem I guess. And perhaps we need actual constant SIMD nodes in the IR, though the cases where that would be useful are pretty rare today.
Yes, but even if we get rid of gather there's also |
We might be able to implement that in software, similarly to how we are doing
I think we could probably get by, at least initially with a 16-byte aligned data section and just using |
Maybe I'm missing something but what would that buy us? The software implementation always generates poor code when the arguments are all constants. In fact, they don't even have to be all constants, just enough of them for the trade off between the cost of a memory load and the cost of assembling the vector from elements to be right.
Yes, I didn't sit down to count the cycles yet but it's likely that the code of an unaligned memory load and the cost of assembling the vector from elements is similar. |
I was basically suggesting that, if we were just checking for constants in the importer then we could avoid needing to create a node with 4 or more operands. We could then also deal with that for the non-constant case by importing the correct chain of intrinsics rather than having a specialized But, I see that dotnet/coreclr#14393 was doing it in lowering; so we have to have a node that can carry all the operands to support that. |
Yeah, I'm pretty sure the same approach could also be used during import. But if effort is put into handling this case then why not do it in the right place, in lowering, where you have a chance to catch more cases due to the optimizer producing new constants? Having to handle more than 3 operands is a bit of an annoyance but it's not that bad. AFAIR the main issue is that I need to store the number of operands in the node itself, because it cannot be determined solely from the intrinsic id. Another bytes (or couple of bits) I need to try to squeeze between the rest of the data, to avoid making the node larger. Of course, I could just use a linked list, like I'm using for |
Looking back at my comment, I think that my comment about |
I would like to finish this work, by deleting
I propose the following design: struct GenTreeMultiOp : GenTree
{
uint8_t m_operandCount; // We have 3 spare bytes in the base GenTree, we'll use 1 to store the count.
}
struct GenTreeMultiOp : GenTree
{
GenTree** m_operands;
}
struct GenTreeJitIntrinsic : GenTreeMultiOp
{
GenTree* inlineOperands[2];
// The rest of the data members fit into 8 bytes, making this a small node.
} The I have also made the amount of inline operands configurable (that is why they live in a derived class and not in You can see the fully functional set of changes that will be required to implement this design here. It comes with a ~0.25% reduction in the number of retired instructions when replaying the x64 benchmarks collection, which is a nice bonus. cc @dotnet/jit-contrib, @tannergooding |
My own view here (as someone who semi-regularly contributes but doesn't work on the JIT team) is that while this might be a general improvement, its not clear that its the right first step and doesn't look to take In particular, I think part of the problem is that there are multiple ways to access and enumerate the operand nodes. That is, you can access the fields directly as Likewise, while we have So I think that the right "first step" is to clean this dependence on "internals" up. That is, it shouldn't matter how we internally represent 0/1/2/3+ operands, this should be completely hidden to the general consumer. We should:
Once we have this and everything is going through these APIs for accessing and dealing with operands, then we no longer have any concern about how its implemented internally. We are free to have inline operands or linked lists or dynamically allocated arrays and to freely swap it around without worrying about |
I have been thinking about this for the past few days, and here's what what I came to. First, what we cannot change:
Given the above, I simply do not see a way to "fully" unify the operands iteration without TP impact (trivially, getting the Aside from that,
I note that I explicitly did not take calls into account because I do not think it is fruitful at this time to try and make them use arrays (for the reasons mentioned above). Calls have very different requirements to "primitive" operations like the intrinsics represent (which is to say they need "fat" |
I think this is worth validating. Ideally for all our scenarios the C++ methods are appropriately marked That is, having a Even
Yes, but today these details on how its stored leak at every level and that's the main problem that I've seen. Because everything is exposed differently based on the node kind, you have to remember anytime you deal with operands that My main concern here is that |
Well, one of the problems is that "opers" represent a hierarchy that the node types are disjoint from. That is, if we are passing a And of course, I note that the proposed changes simplify the intrinsic nodes in this respect: there will only be one way to access their operands (modulo iterators which exist for convenience), using
What are the examples of this w.r.t to the operands representation? From what I've seen so far it is "ok", modulo some very special cases that should really just be deleted for good (e. g. some SIMD init with a zero drops the zero, making it implicit), or are fixable without changing the operand layout itself (e. g. how
But why? In other words, what things become more difficult with the proposed changes? The way I see it, it just makes things more uniform and streamlined, and, while not overhauling the Jit, is a step in the right direction (eliminating class of IR nodes that are not actual nodes - the last one is |
With #59912 merged, this work has now been completed 🎉! Remaining simplifications enabled by this change are tracked in source via |
Some IR nodes require a variable number of operands and the current implementation uses linked lists of
GT_LIST
(or similar) nodes.IR nodes knows to use such lists:
GT_CALL
- has a dedicatedthis
argument operand and a list of operands for all other argumentsGT_PHI
- this is aGTK_UNOP
node and its sole operand is a listGT_HWINTRINSIC
- this is aGTK_BINOP
node but if needs 3 operands or more the first operand is a list and the second is nullGT_FIELD_LIST
- aGTK_BINOP
that acts both like an actual IR node or like a list nodeFrom the above only
GT_CALL
is reasonable, though it could probably be improved as well. The rest suffer from various issues:GT_PHI
- this looks similar toGT_CALL
on the surface but it lacks the special handling of calls throughout the JIT. Its list nodes end up in the linear order and they are visited by tree walking facilities likeGenTreeVisitor
. This is nonsense - these list nodes are an internal implementation detail of theGT_PHI
node and they should not appear anywhere. They have no type, no VN, no nothing. They simply do not exists as far as the IR is concerned.GT_HWINTRINSIC
suffers from the same problem asGT_PHI
. It just makes it worse by sometimes using lists and sometimes not. And unlikeGT_PHI
, this lacks special handling ingtDispTree
and dumps its list nodes.GT_FIELD_LIST
IMO it would be better have aGT_PACK
node with a list ofGT_FIELD_LIST
nodes.GT_LIST
inGT_PHI
([Feedback] Can not update to .NETCore 1.1.0 after upgrading to Microsoft.NETCore.UniversalWindowsPlatform v5.3.0 #20266)GT_LIST
inGT_CALL
(dotnet-runtime-2.0.0 installation fails in centos 7.3 #26392)GT_LIST
inGT_FIELD_LIST
(Calling 3rd party DLL that makes calls out to the internet from .Net core fails proxy authentication despite configuring or coding the required settings #26800)GT_LIST
inGT_SIMD
([WIP] Stop using LIST nodes for SIMD operand lists #1141)GT_LIST
inGT_HWINTRINSIC
GT_LIST
codecategory:implementation
theme:ir
skill-level:expert
cost:extra-large
The text was updated successfully, but these errors were encountered: