-
Notifications
You must be signed in to change notification settings - Fork 631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
apply_scale_32
not lowering to ARM NEON sqdmulh
instructions
#9109
Comments
tosa.mul {shift=31}
and apply_scale_32
are not lowering to ARM NEON sqdmulh
instructionsapply_scale_32
not lowering to ARM NEON sqdmulh
instructions
@bjacob This should be fixed for given we have a 32-bit version of the lowering now. llvm/llvm-project@9294a1e |
does 'fixed' mean 'actually generate a fixed-point mul instruction like |
At this point we generate almost purely If not we would need to have some intrinsic we can insert instead of the minimal |
Before going into the backend to see whether it uses the opportunity to generate a In particular, the IR going into the backend must convey the information that at least one of the operands is not The other part as explained in the above PR description is that |
Hi @bjacob. You mention in the “meeting the no-overflowing-int32 requirement” of the issue statement that a left shift of (31-shift) on the value to be scaled could overflow an int32. I think the third line of the TOSA spec for apply_scale_32 is relevant to this. This REQUIRE statement requires “value” to be in a range for which left shift of up to (32-shift) does not overflow (for shift<32). |
Thanks @dominicsymes , I had missed this! So this leaves only the other issue, of avoiding the |
Preamble
ARM NEON has fixed-point multiplication instructions, like
sqdmulh
. Existing NN inference solutions (TFLite, ruy, XNNPACK) use them. It's not possible to match their performance in quantized workloads without using them. These instructions are important primarily1 in quantization rescalings of internalint32
accumulators at the end of each quantized layer.ARM has already successfully invested effort in specifying rescalings in a way that allows bit-exact agreement across implementations that use
sqdmulh
and implementations that perform plainint64
arithmetic (google/ruy#227).So my understanding has been that it is the intent that the fixed-point multiplications in the TOSA spec are implementable by means of
sqdmulh
. Unfortunately it seems that some details are currently preventing that from happening, and that fixing that would require untangling multipl issues scattered across: the TOSA spec, the MLIR TosaToLinalg pass, and codegen passes specializing to ARM codegen. So the present issue is filed as an overall tracking, motivating issue for all that.Requirements to be able to use
sqdmulh
In order to be implementable as a
sqdmulh
, a fixed-point multiplication needs to satisfy 2 criteria:int64
value((a * b) >> shift)
must be inint32
range, with the only exception being saturating when botha
andb
areINT32_MIN
.INT32_MIN
, the result must be allowed to be off by 1 (saturating 2^31 toINT32_MAX
).Getting TOSA's
apply_scale_32
to meet the requirements to usesqdmulh
Meeting the no-
INT32_MIN
requirementCrucially,
apply_scale_32
has aREQUIRE(multiplier>=0)
statement that ensures that the multiplier operand is never INT32_MIN, so the above requirement 2 is automatically met. In fact, I guess that that was the motivation for thatREQUIRE(multiplier>=0)
statement.One note though: that
REQUIRE(multiplier>=0)
information is lost in the TosaToArith pass. Either one needs to find a way for TosaToArith to preserve that information, or one needs to have explicit lowering tosqrdmulh
intrinsic before TosaToArith.Meeting the no-overflowing-int32 requirement
In practice it is the case in at least 90% of use cases that the shift amount is a compile-time constant and is at least 31. That automatically ensures that
((a * b) >> shift)
is inint32
range.So we should have a pattern identifying that case and generating a
sqrdmulh
intrinsic for that case.For the remaining cases where the shift amount either isn't known at compile time, or is less than 31, conformance with the current TOSA spec requires materializing
i64
products. A currently non-conformant approach (mirroring what TFLite backends do) would be to left-shift the pre-rescale int32 values by (31-shift), before passing that tosqdmulh
. That left-shift could overflow, and the looseness here is in assuming that the overflow won't actually happen.Maybe an attribute could be added to TOSA ops that perform an
apply_scale_32
, allowing to specify that behavior. The TFLite-to-TOSA import pass would be able to set that attribute since TFLite is already making that assumption.Footnotes
Secondarily, they may also be used in implementations of math functions, when a fixed-point strategy is followed, but that's less essential as there are alternatives, such as dequantizing to float or lookup tables, depending on the function at hand. ↩
The text was updated successfully, but these errors were encountered: