-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation-dependent reciprocal [sqrt] approximation instructions #4
Comments
My main objection to doing this early is that the results differ per platform, which WebAssembly has tried very hard to avoid. These instructions have a history of providing great perf when precision doesn't matter, so I see them as somewhat special. I think we need to quantify these gains across "tier 1" ISAs, in the context where they're usually employed. We then need to also document what kind of results they return in different implementations so that our spec can provide some bounds (cf. JavaScript's original sin / cos for how not to do this). |
Fortunately, these functions are better behaved than sin/cos, so it should be easy to specify a quite tight maximum relative error along with an allowance for weird results near 0. For example, Intel promises |Relative Error| ≤ 1.5 ∗ 2−12, with subnormal inputs mapped to ±∞. There should also be less room for software shenanigans since straight up computing 1/x or 1/sqrt(x) in one or two instructions is much faster than computing sin(x). |
A concern I have is that f64x2 is not supported well on platforms other than Intel. Implementations would be forced to lower these opcodes to the equivalent scalar ops, probably leading to zero or negative performance improvement. So I'm wondering if the f64x2 type should be included. |
@billbudge (hi Bill!) is this specifically about f64x2 approximation instructions, or about f64x2 in general? |
Hi JF. Yes, f64x2 in general. We think it will be hard enough to get clear performance wins with f32x4, so somewhat skeptical about f64x2, especially on platforms like ARM. |
In ARMv8 A64 supports f64x2. I'm happy with this, but I agree that this type is related to the "portable performance" bar I suggested. I think it's a separate issue from approximation instructions though. Could you file it? |
To be debated in #3. Add a complete list of omitted operations to Overview.md.
Fold the portable-simd.md specification and the WebAssembly mapping into a single SIMD.md document that is specific to WebAssembly. This is easier to read and avoids a lot of confusion. - Don't attempt to specify floating-point semantics again. Just refer to existing WebAssembly behavior. - Remove the minNum and maxNum operations which were never intended to be included in WebAssembly. - Clarify the trapping behavior of the float-to-int conversions. - Remove the descriptions of recipriocal [sqrt] approximations. See #3. - Rename all operations to use snake_case. - Add *.const instructions for all the new types. - Remove the partial load and store operators as suggested by @sunfishcode. They were removed from asm.js too.
As long as the constraint about the reproducibility of the results across clients is not soften, this question has absolutely no answer. But if this constraint disappears, then I think the best way would be to have something like this:
|
From the ISAs with reciprocal square-root estimate which one has the largest error? Some (most?) ISAs also provide instructions that perform single Newton-Raphson iterations for this operation, so an alternative here would be to just specify an upper bound on the error, and use the reciprocal square-root instruction of the ISA when it satisfies the upper bound, and add one or more NR-iterations to that result as required in ISAs where this is not the case. |
I would say Neon, but I don't remember well. Most documents I have found don't specify the precision of this instruction. And some give a maximum error of 1/4096 (12 bits) which would be the same as x86, but in my experience, it's less accurate than intel's: either arm is less accurate than 12 bits, or Intel is more accurate...
Only Neon supports such an instruction. With other ISAs, you need to implement the NR iteration yourself with a bunch of MUL/FMA.
I tend to think that's the way to do it. Some extra thoughts: AVX512 has now instructions that are more accurate for those operations: 14 and 28 bits |
Do you know the name of the VSX instruction that supports 64-bit rsqrte on powerpc? The 32-bit instruction also has an error of 1/4096 like the SSE instructions. MIPS MSA supports 32-bit and 64-bit reciprocal square-root estimate with an error of at most 2 ULPs. Also worst case one can implement these by doing a full vector square root, which most ISAs support, followed by a divide. That's slow, but would still be a conforming implementation. |
It seems that PowerPC implements this for 64-bit floats with 14-bit precision. So 12-bit precision for 32-bit and 14-bit precision for 64-bit look like "reasonable" upper bounds on the error to me. |
On 64 bits, there is
I never used MIPS, so you're telling me.
I have to disagree with your conclusion: One way to implement the fast reciprocal for F64 when you don't have any instruction is to convert your input fo F32, call the fast reciprocal on F32, and convert back to F64. Actually, that's why I would recommend the precision not to be fixed: The instruction could take an immediate to specify the accuracy required, and when the assembly is generated: the fast reciprocal instruction will be used and enough NR iterations will be generated to achieve the accuracy required depending on the actual hardware. So users should not write NR themselves and just specify they want for instance a 20 bit accurate result and that's it.
I totally agree with you on that point. By the way, if the target precision is too high, that implementation can be the fastest implementation possible. |
While I think this is what some ISAs actually do (does PowerPC 64 do this?), the largest representable
This is actually a really interesting idea.
I think making them an immediate is reasonable: those who want run-time behavior can just branch on it (e.g. have a switch that calls the intrinsic for different immediates), that's what the compiler will have to generate if it were to accept a run-time argument anyways (although the compiler might be able to be a bit more clever about the switch since it might know for a particular target which options are available). |
I never said that was the only way to implement it. That's true this specific implementation does not support input outside the range of F32. Another way to implement it is with some bit hacks: https://en.wikipedia.org/wiki/Fast_inverse_square_root
Agreed. I don't see any valid use case for a runtime precision requirement anyway. |
This is an interesting area to explore. Implementing the NR step seems simple enough on non-NEON backends. I would need some help figuring out how many steps of NR to perform to achieve the specified accuracy: From https://www.felixcloutier.com/x86/rcpps, the relative error is (Sorry I know nothing about numerical analysis, any pointers here will be helpful) |
Here is an old unpublished paper of mine. While its main topic is not related to reciprocal (sqrt) approximation operations, it does contain probably the best description and semi-analytical model of the subject in Section 4. |
As discussed in https://github.com/WebAssembly/meetings/blob/main/simd/2022/SIMD-02-18.md#aob (search for reciprocal), this is likely out of scope, adding a label to indicate so. |
These were originally proposed as a part of the fixed-width SIMD proposal, and were then migrated to the relaxed-simd proposal which also deems these operations out of scope. Github issue: WebAssembly/relaxed-simd#4 Bug: v8:12284 Change-Id: I65ceb6dfd25c43cf49bd7ec5b5ecd6b32cc3516a Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/3595970 Reviewed-by: Thibaud Michaud <[email protected]> Commit-Queue: Deepti Gandluri <[email protected]> Cr-Commit-Position: refs/heads/main@{#80125}
The proposal in WebAssembly/simd#1 includes these instructions:
f32x4.reciprocalApproximation(a: v128) -> v128
,f32x4.reciprocalSqrtApproximation(a: v128) -> v128
,f64x2.reciprocalApproximation(a: v128) -> v128
, andf64x2.reciprocalSqrtApproximation(a: v128) -> v128
.The corresponding scalar instructions are mentioned in the future features design document.
These instructions are available in the ARM, Intel, and MIPS SIMD instruction sets. They are typically implemented as table-driven piece-wise linear approximations on the interval [0.5; 1) or [0.25; 1) and then extended to the full exponent domain. However, the exact nature of the approximation is implementation-dependent, so different ISAs will get different results.
The approximations are either used as-is for low-precision graphics or physics code, or they are used as starting points and refined with one or more steps of Newton-Raphson.
The text was updated successfully, but these errors were encountered: