-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use SIMD operations in InitBlkUnroll/CopyBlkUnroll and increase unroll limit up to 128 bytes #61030
Conversation
Tagging subscribers to this area: @JulieLeeMSFT Issue Detailsbenchmarks.run.windows.arm64.checked.mch:
Detail diffs
coreclr_tests.pmi.windows.arm64.checked.mch:
Detail diffs
libraries.crossgen2.windows.arm64.checked.mch:
Detail diffs
libraries.pmi.windows.arm64.checked.mch:
Detail diffs
libraries_tests.pmi.windows.arm64.checked.mch:
Detail diffs
|
Tagging subscribers to this area: @JulieLeeMSFT Issue Detailsbenchmarks.run.windows.arm64.checked.mch:
Detail diffs
coreclr_tests.pmi.windows.arm64.checked.mch:
Detail diffs
libraries.crossgen2.windows.arm64.checked.mch:
Detail diffs
libraries.pmi.windows.arm64.checked.mch:
Detail diffs
libraries_tests.pmi.windows.arm64.checked.mch:
Detail diffs
|
I am also adding micro benchmarks to test performance impact of changes surrounding FWIW, the current change is mostly CQ improvements with little effect on the performance. As you can notice, I am not changing The following are the results of running the new micro benchmarks with the proposed increased
|
94eaaaa
to
3c84b5a
Compare
929dc5d
to
5bc4a6e
Compare
5bc4a6e
to
653c8c0
Compare
653c8c0
to
7063255
Compare
7063255
to
c3566b1
Compare
/azp run runtime-coreclr outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
…lkUnroll() in src/coreclr/jit/codegenarmarch.cpp
…r/jit/codegenarmarch.cpp
…pyBlockUnrollHelper in src/coreclr/jit/codegenarmarch.cpp
…eg for InitBlock in src/coreclr/jit/lsraarmarch.cpp
…eg for CopyBlock in src/coreclr/jit/lsraarmarch.cpp
…adOrStorePairOffset() helper in src/coreclr/jit/emitarm64.h src/coreclr/jit/emitarm64.cpp
…lr/jit/codegenarmarch.cpp
This pull request has been automatically marked |
…r are contained and can potentially occupy up to 2 int registers
@BruceForstall Please take another look. I think I addressed your feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/azp run runtime-coreclr outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice solution, it looks like it was a lot of fun.
static unsigned GetRegSizeAtLeastBytes(unsigned byteCount) | ||
{ | ||
assert(byteCount != 0); | ||
assert(byteCount < 16); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would not be assert(byteCount <= 16);
less confusing?
|
||
void StorePairRegs(int offset, unsigned regSizeBytes) | ||
{ | ||
canEncodeAllStores = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit canEncodeAllStores &|
? It is safe for bool-s in C++.
bool canEncodeAllStores; | ||
}; | ||
|
||
class ProducingStreamBaseInstrs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does base mean no-simd?
|
||
private: | ||
template <class InstructionStream> | ||
void UnrollInitBlock(InstructionStream& instrStream, int initialRegSizeBytes) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code convention forbids non-const ref args https://github.com/dotnet/runtime/blob/main/docs/coding-guidelines/clr-jit-coding-conventions.md#1535-reference-arguments
intReg1 = node->ExtractTempReg(RBM_ALLINT); | ||
intReg2 = node->GetSingleTempReg(RBM_ALLINT); | ||
break; | ||
default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are this default and the next one for simdRegCount
unreached()
?
{ | ||
public: | ||
ProducingStreamBaseInstrs(regNumber intReg1, regNumber intReg2, regNumber addrReg, emitter* emitter) | ||
: intReg1(intReg1), intReg2(intReg2), addrReg(addrReg), emitter(emitter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should all reg be not REG_NA or does it allow some to be REG_NA?
Currently,
CodeGen::genCodeForInitBlkUnroll()
andCodeGen::genCodeForCpBlkUnroll()
do not use SIMD instructions when generating code forInitBlock
/CopyBlock
operations on Arm and Arm64.This PR adds support for that on Arm64.
In order to accurately estimate whether the use of SIMD instructions is needed and/or whether the base address of a block needs to computed and stored to a integer register the algorithm does multiple passes - (one that verifies whether all offsets can be encoded and another that counts how many instructions will be emitted).
The following are justifications why such approach was chosen:
In fact, that issue was already pointed out by a TODO comment (that is removed by the PR)
In other words it transforms a problem of code-generating of
InitBlock(largeOffset, byteCount)
toInitBlock(0, byteCount)
. In order to do figure out that such transformation is required, the algorithm needs a pass to verify whether all the offsets are encodable.For example, the following snippet shows such case
InitBlockAllZeros(startOffset=8, byteCount=16)
) since the JIT would need to initialize a SIMD register while it could just use astr xzr, xzr, [dstReg, #dstOffset]
instead (assumingdstOffset
is encodable). In order to figure out such cases I've chose to use the same mechanism of multiple passes instead of trying to "encode" all corner cases.As an example, the following demonstrates a case where it is beneficial to "spill" that the base of address of destination in
CopyBlock
and use SIMD instructions than use integer instructions without spilling.In addition to that, the instruction selection strategy that the current implementation uses can be characterized as greedy "best fit" and it doesn't always produce the optimal result (for example,
InitBlock
sequence for 7 bytes isstr rInitValue, [rAddr]; strh rInitValue, [rAddr+4]; strb rInitValue, [rAddr+6]
). I changed the strategy slightly and added support to allow overlapping stores (so that sequence would becomestr rInitValue, [rAddr]; stur rInitValue, [rAddr+3]
.The following is an example where using overlapping loads/stores result in shorter instruction sequence