-
Notifications
You must be signed in to change notification settings - Fork 630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow Softmax at top of MobileBert/int8 profile #8974
Comments
The TOSA code here is exactly as in this LIT tests , so that confirms it's a Softmax as imported from TFLite to TOSA: https://github.com/tensorflow/tensorflow/blob/f8e703259b83f23ef2115edcc2bc2eff0ead1300/tensorflow/compiler/mlir/tosa/tests/tfl-to-tosa-pipeline.mlir#L1385-L1437 |
If there are gather ops (maybe from tosa.table?), the kernel won't get vectorized. Do you have IR dumps for the kernel? We can extract the IR at Linalg level, see https://github.com/google/iree/blob/main/docs/developers/debugging/integration_correctness_issue_breakdown.md#narrow-down-the-repro In short, we can get the IR after |
The above tfl-to-tosa test shows that the |
Just understood: the degree-15 polynomial here is the approximation of (1/(1+x)) - it's how this code is performing the division in the softmax. That's why the input to this polynomial, |
The TFLite->TOSA lowering of Softmax is generated by this code. |
small world - this bit references gemmlowp code I wrote years ago. I had above deciphered the constants 180/255 and 120/255 and hadn't bothered to simplify the fractions -- but 255=15*17 and simplifying away the factors of 15, then rescaling by 4 (change of fixedpoint format) leaves the traditional 48/17 and 32/17 coefficients of Newton-Raphson division. it all makes sense now :-) |
That inspired me to try just dropping the So there are 2 separate problems here --- the exp computation with |
@bjacob I am assigning this to you for now since you are working on this AFAIK, but feel free to move it to me if you arent. |
(I initially wrote this as a reply on #9170 but it better belongs here. Context: we noticed in #9170 that Softmax just became much faster than initially reported here. This is a study of the current (faster) code and what next steps could be taken from here. While faster than before, it is still 20% slower than just dequantizing Softmax, so it's still not a high bar to beat). Thanks @hanhanW for the investigation in #9170 (comment). The IR here is the implementation of Softmax in TOSA, generated during TFLite-to-TOSA here. Latencies for this benchmark on Moto Edge X30:
So this is now much better than before -- but just dequantizing Softmax (ie making it use the float Softmax implementation) would still be another 20% speedup over this. Here's how this currently looks in Tracy on Moto Edge X30:
In summary:
|
@rsuderman @bjacob do we need to lower |
We should defer the decision about arithmetic smarts to later stages. We don't want to do it at TOSA -> Linalg stage. If we really want it in IREE, it can be done before fusion. We're able to vectorize arith ops, but not table lookups. Can we lower the exp op to |
That link is to an implementation of So yes we can just us that by dequantizing softmax, or we could have something even faster with an arithmetic impl of quantized |
The other dimension here is that these scalar table lookups are (even) more inefficient than they need to be, by a factor of > 2x for the following reason. The code means to look up 32bit exponential values, but apparently due to a TOSA limitation it can only perform table lookups of 16bits each -- so it ends up performing 2 lookups of 16bit each, plus bit ops to combine these two 16bit values into a 32bit value. See above disasm for That should be an easy fix: even if the TOSA spec wants to have lookups of only 16bit value, we could extend the MLIR TOSA dialect to support 32bit lookups. |
Would it be possible to keep the exp as is, and when lowering to LLVM convert the exp alone to use the fp32 path (and the polynomial expansion) when it is in scalar form? |
RE 16-bit tosa lookup: that's a restriction of older/smaller ISAs that could only handle 16-bit element lookups (neon, riscv which only has vrgatherei16, etc). It's a layering violation that then when 32-bit lookups are available (avx2/sve) we are still forced down that path early. Would be nice to remove that restriction in TOSA and our lowerings: we should be able to do any arbitrary width lookup and use analysis to narrow the indices when possible. I do want to make sure we aren't establishing that code containing gathers can't be vectorized, though. On avx2 and sve we have native instructions we can use but on older ISA's when those instructions aren't available vectorizing everything except the unrolled non-contiguous loads used to compose vectors is still viable and much better than vectorizing nothing at all (especially when fused with truly vectorizable work). It nets out to essentially just software-level implementations of the missing instructions; the need for non-contiguous loads was not introduced with ML and even though the instructions haven't existed people have been doing it effectively for decades at this point. Whether in the exp case the lookups are worth it is orthogonal - I mostly just don't want "we can't vectorize gathers" to be assumed true because today we can't generate them and going int to float feels like a giant hack we should make sure does not become the normal approach in any case where we happen to fuse a gather with other stuff :) |
To add a concrete example to this, there are fusion opportunities for Softmax/Logistic and the ops that process their output, potentially bypassing most of the dequantizing or exp calculation. In vision models, Softmax is usually followed by some thresholding e.g. discard results where value (probability) < 0.5. This logic can be fused with the Softmax so that only values > 0.5 (for quantized models, the quantized equivalent of 0.5) are included in the exp calculation. In practice, this eliminates the need to run exp on >80% of the values. The logic would change from: to:
|
This is great! This kind of cross op optimization is fair game (and probably worth doing!). I was talking more about specific handling of |
As far as I know, NEON's table-lookup instruction (TBL) only supports loading bytes, not 16-bit elements. There's a much deeper reason though why So when the target is NEON, all Given that the 8bit/16bit restriction of But looking at the TOSA spect for TABLE it's getting clearer what's going to be the real friction if we try to generalize it. It's that TABLE is not always the simple table-lookup operation that I thought it was. It is when the That complication with the int16-indexed case also helps explain why the disasm we're seeing here ( And so now that I understand that, my own take-away from this is that anything that involves Gather-load instructions won't fix that part. Gather-loads help the memory-access part but the problem with |
Interesting! Kind of tips the scales in favor of dequantizing softmax. Tempting as the routes to making quantized softmax less slow are not looking as easy as I hoped (previous comment). |
Sorry, I am going to resurface the question, why do we need to implement a table look up at the TOSA level? All the effort it takes to implement the table lookup at TOSA level can be done at |
Sorry @MaheshRavishankar - I'm not 100% sure to correctly grasp your suggestions, let's discuss this in our 1:1 today. |
Summary of chat with Mahesh:
|
Regarding preserving the div with range info, the way this is sometimes done is to use a quantized type for the element type. Not saying this is a great idea to preserve deeply into the compiler (we really want such types normalized early). Imo, if you want to go down this path, having dedicated arithmetic ops for such things. Regarding a new If I were to attach a principle to this, I would say that fixed point |
Very interesting conversation here! We've been actively following the developing conversation here since it offers valuable low level implemeentation feed
Hope this helps in terms of insight into choices here. We'd be happy to continue this conversation - as we've stated, IREE is at the vanguard of TOSA codegen on CPU and offers tremendously useful feedback! |
See #8974. This is still a 20% end-to-end latency improvement on MobileBert-int8 on configs where matmuls are already reasonably fast, making other things like Softmax more important relatively. That is even after Softmax slowness was much improved recently as observed in #9170. Moreover, discussion around #8974 suggests that the path forward for non-dequantized Softmax is nontrivial, so putting our benchmarks on the dequantized path for now will help insulate them a bit from what we expect to be in-flux for the foreseeable future.
Dequantization of softmax has landed (#9337) and the expected performance improvement has materialized on the dashboard (#9337 (comment)). @sjarus thanks for the insights, i'll follow up with you separately on these ramifications. With #9337 we have merely "unblocked" ourselves in that these topics don't block IREE getting good e2e performance anymore, but they're still important topics in their own right and eventually they'll unlock getting softmax to perform even better than a decent dequantized impl. |
Sounds great, @bjacob! |
See iree-org#8974. This is still a 20% end-to-end latency improvement on MobileBert-int8 on configs where matmuls are already reasonably fast, making other things like Softmax more important relatively. That is even after Softmax slowness was much improved recently as observed in iree-org#9170. Moreover, discussion around iree-org#8974 suggests that the path forward for non-dequantized Softmax is nontrivial, so putting our benchmarks on the dequantized path for now will help insulate them a bit from what we expect to be in-flux for the foreseeable future.
Profiling MobileBert/int8/experimental(mmt4d)/dotprod, where matmuls themselves are relatively fast, makes the rest show more prominently in profiles.
According to Tracy, ~50% of time is being spent in a dispatch that appears to be a
Softmax
. At least it plausibly looks like one as it performs some table-lookups, a sum-reduction, then evaluates a degree-15 polynomial approximation of a math function and multiplies that together, like a Softmax would. And we know that MobileBert contains lots of Softmax.TOSA "source" code of the slow loop
My own hand-deciphering of that TOSA code into pseudo-C (where I got to understand that it's evaluating a degree-15 polynomial, etc).
disassembly from Tracy:
We can see that it's scalar code, not SIMD. The
x
andw
registers are scalar registers (64bit and 32bit respectively).Getting this code to vectorize properly is likely to require some measure of explicit vectorization using at least ARM NEON intrinsics. The reason is that the efficient lowering of these fixed-point multiplications depends on fine details of the target ISA. As the target is ARM NEON, the fixed-point multiplications of the form
should be explicitly vectorized as
The text was updated successfully, but these errors were encountered: