-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve preferencing and code generation for FMA #12984
Comments
@CarolEidt Should this remain a 5.0 issue, or should it be moved to Future? |
This probably doesn't prioritize over the other work in progress, but I believe it is a fairly straightforward issue to fix, so perhaps it would be a good task for someone wanting to gain expertise in the backend of the JIT. But perhaps change to Future. |
I went ahead and moved it to Future, but we can always choose to bring it back. |
Hi, @CarolEidt. I'm interested in this issue and would like to work on it. Could you please help me to confirm if my understanding of the requirement is correct? Thanks! What we want to improve here is to generate the best fit FMA instruction for situations that the computation result is written into one of the three input operands. To achieve this, we need to add checks and bias containment in lowering. And also do corresponding works in LSRA and Codegen phases. Here are the three places I'm planning to add code in:
|
@weilinwa - That looks right. |
Hi, @kunalspathak. Thanks for the confirmation! |
We also have #12212 and #11260 as related issues. IIRC, most of these issues were root caused to #6358. That is, to allow the register allocator to set multiple operands as "optional" for the cases where the instruction is either commutative (e.g. There are several notes in both of the related issues that provide some additional analysis into the issue and potential avenues for fixing it. |
There might be better way to do this, but one of the method that I can think of is runtime/src/coreclr/jit/lower.cpp Line 6257 in d4b98b9
|
Sorry for the delay in response (I'm now retired). You've got the right general idea. If you've got an expression like |
Nice to see you @CarolEidt ! :) |
@CarolEidt, thanks for your response! |
@tannergooding @kunalspathak, I implemented a solution by comparing the lclNum of the overwritten with the three ops and choose FMA accordingly if there is a match.
The code is looking like this. Is this on the right track?
|
Here are some of the results I got using the test cases from those related issues:
Original codegen:
New codegen:
|
Original codegen:
New codegen:
|
@weilinwa - Thanks, this looks pretty good. |
@kunalspathak @tannergooding, I also have a question about instructions like
Are these two instructions in sequence redundant? I saw this type of instructions in different places. Should we consider to optimize them? |
We should understand why these are getting generated. We detect such redundant moves in arm64 using IsRedundantLdStr]( runtime/src/coreclr/jit/emitarm64.cpp Lines 15626 to 15648 in cd1b4cf
|
I agree. This looks like a good improvement over the codegen we had. Kunal is probably the best to comment if the LSRA changes are correct or will run into any issues.
@kunalspathak, we don't have an equivalent in x86/x64 AFAIK. We only recently added minimal redundant move detection in .NET 6 |
I have a commit to extend this to xarch. The commit will essentially mirror what we do for Register->Register movs but with slight modifications to customize it for mem<->reg mov.
|
Issue appears fixed; closing |
As pointed out in dotnet/coreclr#25387, there are improvements to be made in the handling of the FMA intrinsics:
Lowering
should bias the containment such that the overwritten operand is a last use if possible.dotnet/coreclr#25387 includes a code snippet illustrating the issue.
category:cq
theme:vector-codegen
skill-level:expert
cost:large
The text was updated successfully, but these errors were encountered: