-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider FMA #5
Comments
It would help performance to just have a multiply-add instruction, which could be optionally implemented with FMA |
I think 3 operations are actually needed:
The first operation is actually not needed at all: this can be easily detect by the VM as a sequence of a mul and add. To be noted that some hardware have special instructions for this very case. I think here it would be really important to detect the presence of hadware instruction for FMA. |
Performing the operation with intermediate rounding can already be expressed using What most apps want is "perform |
Here my example: without FMA error-free Dekker’s multiplication algorithm requires 17 floating-point operations. But if it will be available in WASM, it will be required only 2 FLOP. [1] This algorithm probably have a biggest performance boost because it requires precision. Theoretically FMA can gives 17/2 * 100% - 100% = +750% to this algo, but here you can find real benchmark. You can find another applications and papers where it is usefull. References:
|
WebAssembly/simd#79 was an initial proposal for qfma, linking to here here since we can't transfer PRs. |
There are a whole bunch of fundamental computational building blocks where people care about the precise bits of a floating point number, and careful error analysis has been done to make guarantees about maximum errors, where using FMA greatly simplifies and speeds up the algorithm but naively performing "a + b * c" with intermediate rounding totally wrecks the result and a more complicated algorithm must be used instead if the mul and add operations are separate. These are especially common in computational geometry where slight rounding errors can break invariants that cause higher-level algorithms to give hilariously incorrect results, go into an infinite loop, or crash. For context see https://www.cs.cmu.edu/~quake/robust.html or for a list of relevant papers, try https://scholar.google.com/scholar?q=geometric+predicate+FMA – as munrocket points out correct double-double arithmetic is another (closely related) example. Likewise, substituting FMA for a + b * c will often wreck computations which expect (rely on) intermediate rounding. Opting people into an FMA they didn’t explicitly ask for will break their code. FMA is widely supported by hardware. It would be great to have some syntax/function for an explicit FMA in wasm (and every other programming language). In contexts where it is unavailable the code author can write a separate fall-back algorithm. |
Fused multiply-add is being discussed in WebAssembly/simd#8. Let's move the discussion here so we don't drop it.
I'd consider FMA after the first shot at SIMD because, similar to #4, it doesn't have universal platform support. Specific benchmarks may benefit from its presence. As with other ops, maybe we should consider adding support for scalar and vector FMA at the same time.
FMA is part of IEEE 754-2008, but isn't universally supported and can't be polyfilled efficiently. Maybe implementations should simply not advertise FMA if the hardware doesn't support it, forcing users to feature-detect, as opposed to offering non-fused fallback which yield different numerical results.
FMA offers multiply+add with one less rounding, sometimes better runtime performance (sometimes because the hardware itself is magical, it avoids register issues, has its own execution unit, etc), and may have better instruction-cache impact.
It's supported in languages like C or C++ through opt-in with pragma
fp_contract
, explicit math.hfma
function and friends, or fast math flags. I would disallow such optimizations by WebAssembly compilers, only the developer-side compiler should be allowed to emit that opcode.It's supported in recent x86, and all ARMv8.
The text was updated successfully, but these errors were encountered: