Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apply_scale_32 not lowering to ARM NEON sqdmulh instructions #9109

Open
bjacob opened this issue May 12, 2022 · 6 comments
Open

apply_scale_32 not lowering to ARM NEON sqdmulh instructions #9109

bjacob opened this issue May 12, 2022 · 6 comments
Assignees
Labels
codegen/llvm LLVM code generation compiler backend integrations/tosa Tensor Operator Set Architecture (TOSA) import, tests, etc. performance ⚡ Performance/optimization related work across the compiler and runtime

Comments

@bjacob
Copy link
Contributor

bjacob commented May 12, 2022

Preamble

ARM NEON has fixed-point multiplication instructions, like sqdmulh. Existing NN inference solutions (TFLite, ruy, XNNPACK) use them. It's not possible to match their performance in quantized workloads without using them. These instructions are important primarily1 in quantization rescalings of internal int32 accumulators at the end of each quantized layer.

ARM has already successfully invested effort in specifying rescalings in a way that allows bit-exact agreement across implementations that use sqdmulh and implementations that perform plain int64 arithmetic (google/ruy#227).

So my understanding has been that it is the intent that the fixed-point multiplications in the TOSA spec are implementable by means of sqdmulh. Unfortunately it seems that some details are currently preventing that from happening, and that fixing that would require untangling multipl issues scattered across: the TOSA spec, the MLIR TosaToLinalg pass, and codegen passes specializing to ARM codegen. So the present issue is filed as an overall tracking, motivating issue for all that.

Requirements to be able to use sqdmulh

In order to be implementable as a sqdmulh, a fixed-point multiplication needs to satisfy 2 criteria:

  1. The int64 value ((a * b) >> shift) must be in int32 range, with the only exception being saturating when both a and b are INT32_MIN.
  2. In the case where both operands are INT32_MIN, the result must be allowed to be off by 1 (saturating 2^31 to INT32_MAX).

Getting TOSA's apply_scale_32 to meet the requirements to use sqdmulh

Meeting the no-INT32_MIN requirement

Crucially, apply_scale_32 has a REQUIRE(multiplier>=0) statement that ensures that the multiplier operand is never INT32_MIN, so the above requirement 2 is automatically met. In fact, I guess that that was the motivation for that REQUIRE(multiplier>=0) statement.

One note though: that REQUIRE(multiplier>=0) information is lost in the TosaToArith pass. Either one needs to find a way for TosaToArith to preserve that information, or one needs to have explicit lowering to sqrdmulh intrinsic before TosaToArith.

Meeting the no-overflowing-int32 requirement

In practice it is the case in at least 90% of use cases that the shift amount is a compile-time constant and is at least 31. That automatically ensures that ((a * b) >> shift) is in int32 range.

So we should have a pattern identifying that case and generating a sqrdmulh intrinsic for that case.

For the remaining cases where the shift amount either isn't known at compile time, or is less than 31, conformance with the current TOSA spec requires materializing i64 products. A currently non-conformant approach (mirroring what TFLite backends do) would be to left-shift the pre-rescale int32 values by (31-shift), before passing that to sqdmulh. That left-shift could overflow, and the looseness here is in assuming that the overflow won't actually happen.

Maybe an attribute could be added to TOSA ops that perform an apply_scale_32, allowing to specify that behavior. The TFLite-to-TOSA import pass would be able to set that attribute since TFLite is already making that assumption.

Footnotes

  1. Secondarily, they may also be used in implementations of math functions, when a fixed-point strategy is followed, but that's less essential as there are alternatives, such as dequantizing to float or lookup tables, depending on the function at hand.

@bjacob bjacob added performance ⚡ Performance/optimization related work across the compiler and runtime codegen/llvm LLVM code generation compiler backend integrations/tosa Tensor Operator Set Architecture (TOSA) import, tests, etc. labels May 12, 2022
@bjacob bjacob changed the title tosa.mul {shift=31} and apply_scale_32 are not lowering to ARM NEON sqdmulh instructions apply_scale_32 not lowering to ARM NEON sqdmulh instructions May 12, 2022
@bjacob bjacob self-assigned this Jun 16, 2022
@GMNGeoffrey GMNGeoffrey added this to IREE Jun 27, 2022
@rsuderman
Copy link
Contributor

@bjacob This should be fixed for given we have a 32-bit version of the lowering now. llvm/llvm-project@9294a1e

@bjacob
Copy link
Contributor Author

bjacob commented Jul 21, 2022

does 'fixed' mean 'actually generate a fixed-point mul instruction like sqdmulh' or some intermediate goal?

@rsuderman
Copy link
Contributor

At this point we generate almost purely i32 operations that should pass through the vectorization code. The only section that is not i32 attempts to "encourage" the compile towards sqdmulh. Is there a good way to see whether the backend compiler actually pattern matches the behavior?

https://github.com/llvm/llvm-project/blob/main/mlir/test/Conversion/TosaToArith/tosa-to-arith.mlir#L29

If not we would need to have some intrinsic we can insert instead of the minimal i64 code above.

@bjacob
Copy link
Contributor Author

bjacob commented Jul 22, 2022

Before going into the backend to see whether it uses the opportunity to generate a sqdmulh, we need to make sure that we actually create that opportunity. The requirements are outlined in this PR's description > "Getting TOSA's apply_scale_32 to meet the requirements to use sqdmulh".

In particular, the IR going into the backend must convey the information that at least one of the operands is not INT32_MIN. As explained in the PR description above, that is implied by the tosa.apply_scale having a REQUIRE(multiplier>=0), but that information is lost in TosaToArith. The above tosa-to-arith.mlir link in the previous comment confirms that information is lost: there is no way for the compiler backend to know that both sides can't simultaneously be INT32_MIN, making it legal to compile this as a sqdmulh.

The other part as explained in the above PR description is that tosa.apply_scale_32 allows the shift argument to be as low as 2. The sqdmulh instruction can be used only if shift>=31, and that needs to be statically known.

@dominicsymes
Copy link

Hi @bjacob. You mention in the “meeting the no-overflowing-int32 requirement” of the issue statement that a left shift of (31-shift) on the value to be scaled could overflow an int32. I think the third line of the TOSA spec for apply_scale_32 is relevant to this. This REQUIRE statement requires “value” to be in a range for which left shift of up to (32-shift) does not overflow (for shift<32).

@bjacob
Copy link
Contributor Author

bjacob commented Jul 22, 2022

Thanks @dominicsymes , I had missed this!

So this leaves only the other issue, of avoiding the INT32_MIN * INT32_MIN case which would overflow / where sqdmulh saturates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
codegen/llvm LLVM code generation compiler backend integrations/tosa Tensor Operator Set Architecture (TOSA) import, tests, etc. performance ⚡ Performance/optimization related work across the compiler and runtime
Projects
No open projects
Status: No status
Development

No branches or pull requests

3 participants