`apply_scale_32` not lowering to ARM NEON `sqdmulh` instructions #9109

bjacob · 2022-05-12T20:57:53Z

Preamble

ARM NEON has fixed-point multiplication instructions, like sqdmulh. Existing NN inference solutions (TFLite, ruy, XNNPACK) use them. It's not possible to match their performance in quantized workloads without using them. These instructions are important primarily¹ in quantization rescalings of internal int32 accumulators at the end of each quantized layer.

ARM has already successfully invested effort in specifying rescalings in a way that allows bit-exact agreement across implementations that use sqdmulh and implementations that perform plain int64 arithmetic (google/ruy#227).

So my understanding has been that it is the intent that the fixed-point multiplications in the TOSA spec are implementable by means of sqdmulh. Unfortunately it seems that some details are currently preventing that from happening, and that fixing that would require untangling multipl issues scattered across: the TOSA spec, the MLIR TosaToLinalg pass, and codegen passes specializing to ARM codegen. So the present issue is filed as an overall tracking, motivating issue for all that.

Requirements to be able to use `sqdmulh`

In order to be implementable as a sqdmulh, a fixed-point multiplication needs to satisfy 2 criteria:

The int64 value ((a * b) >> shift) must be in int32 range, with the only exception being saturating when both a and b are INT32_MIN.
In the case where both operands are INT32_MIN, the result must be allowed to be off by 1 (saturating 2^31 to INT32_MAX).

Getting TOSA's `apply_scale_32` to meet the requirements to use `sqdmulh`

Meeting the no-`INT32_MIN` requirement

Crucially, apply_scale_32 has a REQUIRE(multiplier>=0) statement that ensures that the multiplier operand is never INT32_MIN, so the above requirement 2 is automatically met. In fact, I guess that that was the motivation for that REQUIRE(multiplier>=0) statement.

One note though: that REQUIRE(multiplier>=0) information is lost in the TosaToArith pass. Either one needs to find a way for TosaToArith to preserve that information, or one needs to have explicit lowering to sqrdmulh intrinsic before TosaToArith.

Meeting the no-overflowing-int32 requirement

In practice it is the case in at least 90% of use cases that the shift amount is a compile-time constant and is at least 31. That automatically ensures that ((a * b) >> shift) is in int32 range.

So we should have a pattern identifying that case and generating a sqrdmulh intrinsic for that case.

For the remaining cases where the shift amount either isn't known at compile time, or is less than 31, conformance with the current TOSA spec requires materializing i64 products. A currently non-conformant approach (mirroring what TFLite backends do) would be to left-shift the pre-rescale int32 values by (31-shift), before passing that to sqdmulh. That left-shift could overflow, and the looseness here is in assuming that the overflow won't actually happen.

Maybe an attribute could be added to TOSA ops that perform an apply_scale_32, allowing to specify that behavior. The TFLite-to-TOSA import pass would be able to set that attribute since TFLite is already making that assumption.

Secondarily, they may also be used in implementations of math functions, when a fixed-point strategy is followed, but that's less essential as there are alternatives, such as dequantizing to float or lookup tables, depending on the function at hand. ↩

The text was updated successfully, but these errors were encountered:

rsuderman · 2022-07-20T22:56:15Z

@bjacob This should be fixed for given we have a 32-bit version of the lowering now. llvm/llvm-project@9294a1e

bjacob · 2022-07-21T00:04:04Z

does 'fixed' mean 'actually generate a fixed-point mul instruction like sqdmulh' or some intermediate goal?

rsuderman · 2022-07-21T23:04:04Z

At this point we generate almost purely i32 operations that should pass through the vectorization code. The only section that is not i32 attempts to "encourage" the compile towards sqdmulh. Is there a good way to see whether the backend compiler actually pattern matches the behavior?

https://github.com/llvm/llvm-project/blob/main/mlir/test/Conversion/TosaToArith/tosa-to-arith.mlir#L29

If not we would need to have some intrinsic we can insert instead of the minimal i64 code above.

bjacob · 2022-07-22T01:20:29Z

Before going into the backend to see whether it uses the opportunity to generate a sqdmulh, we need to make sure that we actually create that opportunity. The requirements are outlined in this PR's description > "Getting TOSA's apply_scale_32 to meet the requirements to use sqdmulh".

In particular, the IR going into the backend must convey the information that at least one of the operands is not INT32_MIN. As explained in the PR description above, that is implied by the tosa.apply_scale having a REQUIRE(multiplier>=0), but that information is lost in TosaToArith. The above tosa-to-arith.mlir link in the previous comment confirms that information is lost: there is no way for the compiler backend to know that both sides can't simultaneously be INT32_MIN, making it legal to compile this as a sqdmulh.

The other part as explained in the above PR description is that tosa.apply_scale_32 allows the shift argument to be as low as 2. The sqdmulh instruction can be used only if shift>=31, and that needs to be statically known.

dominicsymes · 2022-07-22T17:03:18Z

Hi @bjacob. You mention in the “meeting the no-overflowing-int32 requirement” of the issue statement that a left shift of (31-shift) on the value to be scaled could overflow an int32. I think the third line of the TOSA spec for apply_scale_32 is relevant to this. This REQUIRE statement requires “value” to be in a range for which left shift of up to (32-shift) does not overflow (for shift<32).

bjacob · 2022-07-22T17:38:19Z

Thanks @dominicsymes , I had missed this!

So this leaves only the other issue, of avoiding the INT32_MIN * INT32_MIN case which would overflow / where sqdmulh saturates?

bjacob added performance ⚡ Performance/optimization related work across the compiler and runtime codegen/llvm LLVM code generation compiler backend integrations/tosa Tensor Operator Set Architecture (TOSA) import, tests, etc. labels May 12, 2022

bjacob added this to IREE Codegen May 12, 2022

bjacob mentioned this issue May 12, 2022

ARM NEON codegen for tosa.apply_scale naively does i64 arithmetic, should use i32 fixed-point instructions #9105

Closed

bjacob changed the title ~~tosa.mul {shift=31} and apply_scale_32 are not lowering to ARM NEON sqdmulh instructions~~ apply_scale_32 not lowering to ARM NEON sqdmulh instructions May 12, 2022

allieculp added this to IREE - ARCHIVED (do not update) May 17, 2022

This was referenced May 20, 2022

Root cause source of massive latency improvement in quantized MobileBertSquad benchmarks #9170

Closed

Slow Softmax at top of MobileBert/int8 profile #8974

Closed

bjacob self-assigned this Jun 16, 2022

GMNGeoffrey added this to IREE Jun 27, 2022

GMNGeoffrey added this to (Deprecated) IREE Feb 21, 2023

julianwa moved this to Backlog in (Deprecated) IREE Apr 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`apply_scale_32` not lowering to ARM NEON `sqdmulh` instructions #9109

`apply_scale_32` not lowering to ARM NEON `sqdmulh` instructions #9109

bjacob commented May 12, 2022

rsuderman commented Jul 20, 2022

bjacob commented Jul 21, 2022

rsuderman commented Jul 21, 2022

bjacob commented Jul 22, 2022

dominicsymes commented Jul 22, 2022

bjacob commented Jul 22, 2022

apply_scale_32 not lowering to ARM NEON sqdmulh instructions #9109

apply_scale_32 not lowering to ARM NEON sqdmulh instructions #9109

Comments

bjacob commented May 12, 2022

Preamble

Requirements to be able to use sqdmulh

Getting TOSA's apply_scale_32 to meet the requirements to use sqdmulh

Meeting the no-INT32_MIN requirement

Meeting the no-overflowing-int32 requirement

Footnotes

rsuderman commented Jul 20, 2022

bjacob commented Jul 21, 2022

rsuderman commented Jul 21, 2022

bjacob commented Jul 22, 2022

dominicsymes commented Jul 22, 2022

bjacob commented Jul 22, 2022

`apply_scale_32` not lowering to ARM NEON `sqdmulh` instructions #9109

`apply_scale_32` not lowering to ARM NEON `sqdmulh` instructions #9109

Requirements to be able to use `sqdmulh`

Getting TOSA's `apply_scale_32` to meet the requirements to use `sqdmulh`

Meeting the no-`INT32_MIN` requirement