-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Arm64 encodings for IF_SVE_AA_3A to IF_SVE_HL_3A #95127
Conversation
Change-Id: I9d32d52c8a54e4f001a8af5ff556e747087c9149
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsAdd all the groups IF_SVE_**_3A. Based on top of #95105
|
IF_SVE_AA_3A, IF_SVE_AB_3A, IF_SVE_AC_3A are done and fully working. Output in jitdump:
Output in gdb:
|
Picked up the change in dotnet#95127 (comment)
@@ -505,6 +505,9 @@ class emitter | |||
OPSZ32 = 5, | |||
OPSZ64 = 6, | |||
OPSZ_COUNT = 7, | |||
#elif defined(TARGET_ARM64) | |||
OPSZ_SCALABLE = 5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we are using it anywhere. can we revert this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used indirectly. When setting idOpSize, the emitattr is converted into opSize:
void idOpSize(emitAttr opsz)
{
_idOpSize = emitEncodeSize(opsz);
}
/* static */ inline emitter::opSize emitter::emitEncodeSize(emitAttr size)
{
assert((size != EA_UNKNOWN) && ((size & EA_SIZE_MASK) == size));
return static_cast<emitter::opSize>(genLog2(size));
}
/* static */ inline emitAttr emitter::emitDecodeSize(emitter::opSize ensz)
{
assert(static_cast<unsigned>(ensz) < OPSZ_COUNT);
return emitSizeDecode[ensz];
}
Without the changes, a value of EA_SCALABLE
becomes OPSZ_COUNT
, and then asserts inside emitDecodeSize()
src/coreclr/jit/emitarm64.cpp
Outdated
case IF_SVE_HJ_3A: // ........xx...... ...gggmmmmmddddd -- SVE floating-point serial reduction (predicated) | ||
case IF_SVE_HL_3A: // ........xx...... ...gggmmmmmddddd -- SVE floating-point arithmetic (predicated) | ||
emitDispSveReg(id->idReg1(), id->idInsOpt(), true); // ddddd | ||
emitDispPredicateReg(id->idReg2(), true, true); // ggg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am actually wondering if the merge == true
is correct for all the formats it shares?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is one example I found, but there are just handful of them, so perhaps for them, we can split the case.
case IF_SVE_AR_4A: // ........xx.mmmmm ...gggnnnnnddddd -- SVE integer multiply-accumulate writing addend (predicated)
case IF_SVE_GI_4A: // ........xx.mmmmm ...gggnnnnnddddd -- SVE2 histogram generation (vector)
case IF_SVE_HU_4A: // ........xx.mmmmm ...gggnnnnnddddd -- SVE floating-point multiply-accumulate writing addend
emitDispSveReg(id->idReg10(), id->idInsOpt(), true); // ddddd
emitDispSveReg(id->idReg20(), id->idInsOpt(), true); // nnnnn
emitDispPredicateReg(id->idReg30(), id->idInsOpt(), true); // ggg
emitDispSveReg(id->idReg40(), id->idInsOpt(), true); // mmmmm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should take it group by group. Hardcoding with different cases should be ok for most things. But, will need a little more thinking when we get to the instructions that can take /M or /Z
Sure. I have put it in #95105 because that will get merge sooner, unblocking others. |
* Add insEncodeReg* methods * Adjust the `ins` for sve instruction Picked up the change in #95127 (comment) * Update the comments * Use sve_ins_offset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
theEmitter->emitIns_R_R_R(INS_sve_subr, EA_SCALABLE, REG_V2, REG_P0, REG_V13, | ||
INS_OPTS_SCALABLE_S); // SUBR <Zdn>.<T>, <Pg>/M, <Zdn>.<T>, <Zm>.<T> | ||
|
||
theEmitter->emitIns_R_R_R(INS_sve_sdiv, EA_SCALABLE, REG_V3, REG_P2, REG_V9, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tool generated just 1 line per instruction, but we should add some more permutation for registers, emitAttr
and instOpts
to make sure emitIns_*
methods works as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For each small group I've made sure to try to go across as many permutations as possible. (ie: INS_sve_and, INS_sve_bic, INS_sve_eor, INS_sve_orr
are all IF_SVE_AA_3A
and go through the same code so will have the same boundaries). The number of instructions here meant I felt there was decent coverage. I also made sure to hit value boundaries where possible (0, 31 etc).
The only instruction I felt I needed to add more permutations on were the EA_4BYTE
versions of clasta/clastb
.
What I found limiting is we can't test for error cases. Eg: I can't check that I can't generate a B
version of INS_sve_sdiv
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eg: I can't check that I can't generate a B version of INS_sve_sdiv.
Yes, I don't think we have code that verifies that something is not possible to get generated.
@@ -8048,6 +8114,216 @@ void emitter::emitIns_R_R_R( | |||
fmt = IF_DV_3A; | |||
break; | |||
|
|||
case INS_sve_and: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps I should also let the tool generate this code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be useful, even if it's just some isVectorRegister() asserts
@a74nh I see one set of discrepancies in your example above: JitDump:
cstool:
(so, the JIT is generating |
// #define ALL_ARM64_EMITTER_UNIT_TESTS_GENERAL | ||
// #define ALL_ARM64_EMITTER_UNIT_TESTS_ADVSIMD | ||
// #define ALL_ARM64_EMITTER_UNIT_TESTS_SVE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding that categories; hopefully that helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I almost put this in another PR.
Also, the INS_brk
test fails to compile because it should be INS_brk_windows
and INS_brk_linux
.
It be good if we could use the cstool output to automatically have this tested and checked.
Done. This is purely a printing issue, the encoding of the instruction is correct. For every instruction I've done so far, there are no bits to indicate the type ( |
This PR has 82 instructions out of the 889 generated by the tool. That's 9.2%. It took me a week to write the code. |
That's almost 1 hour per instruction. I was hoping it would be faster given the tool generated most of the code, but I guess most of the time taken was also writing some new methods that were needed and stitching everything together. Hoping next batch would be faster and if not, let me know if things need to be added in the tool to reduce the burden. Here are few things that I have:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes looks great to me. Added some minor comments.
src/coreclr/jit/emitarm64.cpp
Outdated
case IF_SVE_GR_3A: // ........xx...... ...gggmmmmmddddd -- SVE2 floating-point pairwise operations | ||
case IF_SVE_HJ_3A: // ........xx...... ...gggmmmmmddddd -- SVE floating-point serial reduction (predicated) | ||
case IF_SVE_HL_3A: // ........xx...... ...gggmmmmmddddd -- SVE floating-point arithmetic (predicated) | ||
result.insThroughput = PERFSCORE_THROUGHPUT_1C; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering if this should be fixed to the real values of latency and throughput. can you point to a document that has these details?
I didn't find anything on internet except this kind of questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a tricky one, as the latency and throughput will change depending on the microarchitecture. For example, neon instructions won't be the same on Altra and N2.
For now, maybe we should just pick one (N2 sounds a reasonable choice for SVE).
There is definitely a future task to add other microarchitecture latencies and throughputs into coreclr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, maybe we should just pick one (N2 sounds a reasonable choice for SVE).
If there is a parsable data, I can probably generate it via the tool with right numbers too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly, only the pdf.
I've manually done it for this PR. It wasn't fun to do. Thankfully, most of your groups are the same as the groups in the guide. But not all.
Not sure how easy it will be to parse as you'll need to turn instructions back into the groups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly, only the pdf.
Can you share the link for pdf that has SVE instructions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
N2 optimisation guide:
https://developer.arm.com/documentation/PJDOC-466751330-18256/0003
SVE is from 3.26.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracted that information and pasted in #94549 (comment). I will see if I can make the tool parse that and generate the right latencies and throughput.
Change-Id: I263ae1c649a66de393bca0c8f5ac2f9c3311fcbe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixed all of these and uploaded latest files in #94549. |
Add the first set of 17 SVE groups.
Contributes towards #94549
Based on top of #95105
Code starts from 55731b4