Implementation-dependent reciprocal [sqrt] approximation instructions #4

stoklund · 2017-04-14T17:10:35Z

The proposal in WebAssembly/simd#1 includes these instructions:

f32x4.reciprocalApproximation(a: v128) -> v128,
f32x4.reciprocalSqrtApproximation(a: v128) -> v128,
f64x2.reciprocalApproximation(a: v128) -> v128, and
f64x2.reciprocalSqrtApproximation(a: v128) -> v128.

The corresponding scalar instructions are mentioned in the future features design document.

These instructions are available in the ARM, Intel, and MIPS SIMD instruction sets. They are typically implemented as table-driven piece-wise linear approximations on the interval [0.5; 1) or [0.25; 1) and then extended to the full exponent domain. However, the exact nature of the approximation is implementation-dependent, so different ISAs will get different results.

The approximations are either used as-is for low-precision graphics or physics code, or they are used as starting points and refined with one or more steps of Newton-Raphson.

The text was updated successfully, but these errors were encountered:

jfbastien · 2017-04-14T21:05:46Z

My main objection to doing this early is that the results differ per platform, which WebAssembly has tried very hard to avoid. These instructions have a history of providing great perf when precision doesn't matter, so I see them as somewhat special.

I think we need to quantify these gains across "tier 1" ISAs, in the context where they're usually employed. We then need to also document what kind of results they return in different implementations so that our spec can provide some bounds (cf. JavaScript's original sin / cos for how not to do this).

stoklund · 2017-04-14T22:44:38Z

Fortunately, these functions are better behaved than sin/cos, so it should be easy to specify a quite tight maximum relative error along with an allowance for weird results near 0. For example, Intel promises |Relative Error| ≤ 1.5 ∗ 2⁻¹², with subnormal inputs mapped to ±∞.

There should also be less room for software shenanigans since straight up computing 1/x or 1/sqrt(x) in one or two instructions is much faster than computing sin(x).

billbudge · 2017-04-19T20:40:31Z

A concern I have is that f64x2 is not supported well on platforms other than Intel. Implementations would be forced to lower these opcodes to the equivalent scalar ops, probably leading to zero or negative performance improvement. So I'm wondering if the f64x2 type should be included.

jfbastien · 2017-04-19T20:48:41Z

@billbudge (hi Bill!) is this specifically about f64x2 approximation instructions, or about f64x2 in general?

billbudge · 2017-04-19T21:04:26Z

Hi JF. Yes, f64x2 in general. We think it will be hard enough to get clear performance wins with f32x4, so somewhat skeptical about f64x2, especially on platforms like ARM.

jfbastien · 2017-04-19T21:58:15Z

In ARMv8 A64 supports f64x2. I'm happy with this, but I agree that this type is related to the "portable performance" bar I suggested.

I think it's a separate issue from approximation instructions though. Could you file it?

To be debated in #3. Add a complete list of omitted operations to Overview.md.

@sunfishcode

Fold the portable-simd.md specification and the WebAssembly mapping into a single SIMD.md document that is specific to WebAssembly. This is easier to read and avoids a lot of confusion. - Don't attempt to specify floating-point semantics again. Just refer to existing WebAssembly behavior. - Remove the minNum and maxNum operations which were never intended to be included in WebAssembly. - Clarify the trapping behavior of the float-to-int conversions. - Remove the descriptions of recipriocal [sqrt] approximations. See #3. - Rename all operations to use snake_case. - Add *.const instructions for all the new types. - Remove the partial load and store operators as suggested by @sunfishcode. They were removed from asm.js too.

lemaitre · 2018-04-18T15:12:56Z

As long as the constraint about the reproducibility of the results across clients is not soften, this question has absolutely no answer.
It is not possible to propose such a function that can give the exact same result across architectures.
SSE, Neon, Altivec, AVX512: all those ISAs have different instructions giving different results.

But if this constraint disappears, then I think the best way would be to have something like this:

f32x4.reciprocalApproximation(a: v128, prec: int8) -> v128
f32x4.reciprocalSqrtApproximation(a: v128, prec: int8) -> v128
f64x2.reciprocalApproximation(a: v128, prec: int8) -> v128
f64x2.reciprocalSqrtApproximation(a: v128, prec: int8) -> v128
will compute an approximation of f(a) with at least prec bits correct.
This instruction will call a fast hardware instruction, and then recover accuracy with some Newton-Raphson iterations to achieve the expected accuracy.
This will improve the portability of such a function, but will still give "slightly" different results from different architectures.

gnzlbg · 2018-05-24T12:46:11Z

From the ISAs with reciprocal square-root estimate which one has the largest error?

Some (most?) ISAs also provide instructions that perform single Newton-Raphson iterations for this operation, so an alternative here would be to just specify an upper bound on the error, and use the reciprocal square-root instruction of the ISA when it satisfies the upper bound, and add one or more NR-iterations to that result as required in ISAs where this is not the case.

lemaitre · 2018-05-24T13:03:23Z

From the ISAs with reciprocal square-root estimate which one has the largest error?

I would say Neon, but I don't remember well. Most documents I have found don't specify the precision of this instruction. And some give a maximum error of 1/4096 (12 bits) which would be the same as x86, but in my experience, it's less accurate than intel's: either arm is less accurate than 12 bits, or Intel is more accurate...

Some (most?) ISAs also provide instructions that perform single Newton-Raphson iterations for this operation

Only Neon supports such an instruction. With other ISAs, you need to implement the NR iteration yourself with a bunch of MUL/FMA.

an alternative here would be to just specify an upper bound on the error, and use the reciprocal square-root instruction of the ISA when it satisfies the upper bound, and add one or more NR-iterations to that result as required in ISAs where this is not the case.

I tend to think that's the way to do it.
Actually, with what I proposed, the accuracy requirement was not fixed but embedded within the instruction.

Some extra thoughts:
reciprocal (square root) on double precision is not available one SSE and AVX (but is available on Neon 64, VSX and AVX512).
It is possible to do a fast and rough estimation in that case though.

AVX512 has now instructions that are more accurate for those operations: 14 and 28 bits

gnzlbg · 2018-05-24T13:34:15Z

Do you know the name of the VSX instruction that supports 64-bit rsqrte on powerpc? The 32-bit instruction also has an error of 1/4096 like the SSE instructions.

MIPS MSA supports 32-bit and 64-bit reciprocal square-root estimate with an error of at most 2 ULPs.

Also worst case one can implement these by doing a full vector square root, which most ISAs support, followed by a divide. That's slow, but would still be a conforming implementation.

gnzlbg · 2018-05-24T14:01:35Z

It seems that PowerPC implements this for 64-bit floats with 14-bit precision. So 12-bit precision for 32-bit and 14-bit precision for 64-bit look like "reasonable" upper bounds on the error to me.

lemaitre · 2018-05-24T15:13:12Z

Do you know the name of the VSX instruction that supports 64-bit rsqrte on powerpc?

On 64 bits, there is xvrsqrtesp for F32 and xvrsqrtedp for F64 (both are available through the C intrinsics vec_rsqrte).
I would assume they have the same precision, I haven't put any effort to confirm this feeling.

MIPS MSA supports 32-bit and 64-bit reciprocal square-root estimate with an error of at most 2 ULPs.

I never used MIPS, so you're telling me.

It seems that PowerPC implements this for 64-bit floats with 14-bit precision. So 12-bit precision for 32-bit and 14-bit precision for 64-bit look like "reasonable" upper bounds on the error to me.

I have to disagree with your conclusion: One way to implement the fast reciprocal for F64 when you don't have any instruction is to convert your input fo F32, call the fast reciprocal on F32, and convert back to F64.
Doing that you "inherit" the precision from F32. By setting the F64 accuracy requirement to be slightly higher than for F32, you clearly forbid such an implementation.

Actually, that's why I would recommend the precision not to be fixed: The instruction could take an immediate to specify the accuracy required, and when the assembly is generated: the fast reciprocal instruction will be used and enough NR iterations will be generated to achieve the accuracy required depending on the actual hardware.

So users should not write NR themselves and just specify they want for instance a 20 bit accurate result and that's it.
However, I think the accuracy requirement should be an immediate and not a runtime argument, but that's up to debate.

Also worst case one can implement these by doing a full vector square root, which most ISAs support, followed by a divide. That's slow, but would still be a conforming implementation.

I totally agree with you on that point. By the way, if the target precision is too high, that implementation can be the fastest implementation possible.

gnzlbg · 2018-05-24T15:32:15Z

I have to disagree with your conclusion: One way to implement the fast reciprocal for F64 when you don't have any instruction is to convert your input fo F32, call the fast reciprocal on F32, and convert back to F64.
Doing that you "inherit" the precision from F32. By setting the F64 accuracy requirement to be slightly higher than for F32, you clearly forbid such an implementation.

While I think this is what some ISAs actually do (does PowerPC 64 do this?), the largest representable f32 is 3.4 x 10^38 while the largest representable f64 is 1.80 x 10^308. There is a whole range of f64s that just are not representable in f32s. At that point, why expose the version for f64 at all? I'd argue that it would be better to just not expose it, and let the user do the truncation to f32 however it sees fit (INFINITY, clamp, etc.).

Actually, that's why I would recommend the precision not to be fixed: The instruction could take an immediate to specify the accuracy required, and when the assembly is generated: the fast reciprocal instruction will be used and enough NR iterations will be generated to achieve the accuracy required depending on the actual hardware.

This is actually a really interesting idea.

However, I think the accuracy requirement should be an immediate and not a runtime argument,

I think making them an immediate is reasonable: those who want run-time behavior can just branch on it (e.g. have a switch that calls the intrinsic for different immediates), that's what the compiler will have to generate if it were to accept a run-time argument anyways (although the compiler might be able to be a bit more clever about the switch since it might know for a particular target which options are available).

lemaitre · 2018-05-24T15:51:11Z

While I think this is what some ISAs actually do (does PowerPC 64 do this?), the largest representable f32 is3.4 x 10^38while the largest representablef64is1.80 x 10^308. There is a whole range off64s that just are not representable inf32s. At that point, why expose the version forf64at all? I'd argue that it would be better to just not expose it, and let the user do the truncation tof32` however it sees fit.

I never said that was the only way to implement it. That's true this specific implementation does not support input outside the range of F32.

Another way to implement it is with some bit hacks: https://en.wikipedia.org/wiki/Fast_inverse_square_root
This article presents only the F32 bit hack (which is not worth anymore), but it can also be done for F64, and it is still worth depending on the accuracy you need.
It requires more NR iterations, but nothing fancy.
This technique support the whole range of F64 (except subnormals)

I think making them an immediate is reasonable: those who want run-time behavior can just branch on it (e.g. have a switch that calls the intrinsic for different immediates), that's what the compiler will have to generate if it were to accept a run-time argument anyways (although the compiler might be able to be a bit more clever about the switch since it might know for a particular target which options are available).

Agreed. I don't see any valid use case for a runtime precision requirement anyway.

Extended loads: 32->64 bit ops

ngzhian · 2021-09-08T20:58:35Z

So users should not write NR themselves and just specify they want for instance a 20 bit accurate result and that's it.
However, I think the accuracy requirement should be an immediate and not a runtime argument, but that's up to debate.

This is an interesting area to explore. Implementing the NR step seems simple enough on non-NEON backends. I would need some help figuring out how many steps of NR to perform to achieve the specified accuracy:

From https://www.felixcloutier.com/x86/rcpps, the relative error is |Relative Error| ≤ 1.5 * 2−12, and from https://en.wikipedia.org/wiki/Division_algorithm#Newton.E2.80.93Raphson_division : "the number of correct digits in the result roughly doubles for every iteration". So if the user wants 28 bits of accuracy for a float, how many rounds do I need?

(Sorry I know nothing about numerical analysis, any pointers here will be helpful)

Maratyszcza · 2021-11-15T21:35:28Z

Here is an old unpublished paper of mine. While its main topic is not related to reciprocal (sqrt) approximation operations, it does contain probably the best description and semi-analytical model of the subject in Section 4.

ngzhian · 2022-03-22T17:23:30Z

As discussed in https://github.com/WebAssembly/meetings/blob/main/simd/2022/SIMD-02-18.md#aob (search for reciprocal), this is likely out of scope, adding a label to indicate so.

These were originally proposed as a part of the fixed-width SIMD proposal, and were then migrated to the relaxed-simd proposal which also deems these operations out of scope. Github issue: WebAssembly/relaxed-simd#4 Bug: v8:12284 Change-Id: I65ceb6dfd25c43cf49bd7ec5b5ecd6b32cc3516a Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/3595970 Reviewed-by: Thibaud Michaud <[email protected]> Commit-Queue: Deepti Gandluri <[email protected]> Cr-Commit-Position: refs/heads/main@{#80125}

stoklund referenced this issue in WebAssembly/simd Apr 25, 2017

Omit the reciprocal [sqrt] approximation operations.

2ce1c71

To be debated in #3. Add a complete list of omitted operations to Overview.md.

penzn referenced this issue in penzn/simd Aug 20, 2019

Merge pull request #3 from penzn/extended-load

20f4468

Extended loads: 32->64 bit ops

ngzhian transferred this issue from WebAssembly/simd Mar 17, 2021

jfbastien mentioned this issue Mar 17, 2021

Consider FMA #5

Open

ngzhian added the instruction-proposal label Mar 19, 2021

ngzhian added the out of scope Out of scope for this proposal, can be explored in a subsequent proposal. label Mar 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation-dependent reciprocal [sqrt] approximation instructions #4

Implementation-dependent reciprocal [sqrt] approximation instructions #4

stoklund commented Apr 14, 2017

jfbastien commented Apr 14, 2017

stoklund commented Apr 14, 2017

billbudge commented Apr 19, 2017

jfbastien commented Apr 19, 2017

billbudge commented Apr 19, 2017

jfbastien commented Apr 19, 2017

lemaitre commented Apr 18, 2018

gnzlbg commented May 24, 2018

lemaitre commented May 24, 2018 •

edited

Loading

gnzlbg commented May 24, 2018 •

edited

Loading

gnzlbg commented May 24, 2018 •

edited

Loading

lemaitre commented May 24, 2018 •

edited

Loading

gnzlbg commented May 24, 2018 •

edited

Loading

lemaitre commented May 24, 2018

ngzhian commented Sep 8, 2021

Maratyszcza commented Nov 15, 2021

ngzhian commented Mar 22, 2022

Implementation-dependent reciprocal [sqrt] approximation instructions #4

Implementation-dependent reciprocal [sqrt] approximation instructions #4

Comments

stoklund commented Apr 14, 2017

jfbastien commented Apr 14, 2017

stoklund commented Apr 14, 2017

billbudge commented Apr 19, 2017

jfbastien commented Apr 19, 2017

billbudge commented Apr 19, 2017

jfbastien commented Apr 19, 2017

lemaitre commented Apr 18, 2018

gnzlbg commented May 24, 2018

lemaitre commented May 24, 2018 • edited Loading

gnzlbg commented May 24, 2018 • edited Loading

gnzlbg commented May 24, 2018 • edited Loading

lemaitre commented May 24, 2018 • edited Loading

gnzlbg commented May 24, 2018 • edited Loading

lemaitre commented May 24, 2018

ngzhian commented Sep 8, 2021

Maratyszcza commented Nov 15, 2021

ngzhian commented Mar 22, 2022

lemaitre commented May 24, 2018 •

edited

Loading

gnzlbg commented May 24, 2018 •

edited

Loading

gnzlbg commented May 24, 2018 •

edited

Loading

lemaitre commented May 24, 2018 •

edited

Loading

gnzlbg commented May 24, 2018 •

edited

Loading