Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

f32x4.add_pairwise and f64x2.add_pairwise #375

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Maratyszcza
Copy link
Contributor

Introduction

Pairwise addition (AKA horizontal addition) is a SIMD operation that computed sums within adjacent pairs of lanes in the concatenation of two SIMD registers. Floating-point pairwise addition is supported on both x86 (since SSE3) and ARM (since NEON) and comprise a commonly used primitive for full and partial sums on SIMD vectors.

Besides pairwise addition, x86 supports pairwise subtraction. ARM doesn't support pairwise subtraction, but offers pairwise maximum and minimum operation. Pairwise subtraction, minimum, and maximum were left out of this PR as each of them would benefit only a single architecture.

Applications

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

  • f32x4.add_pairwise
    • y = f32x4.add_pairwise(a, b) is lowered to VHADDPS xmm_y, xmm_a, xmm_b
  • f64x2.add_pairwise
    • y = f64x2.add_pairwise(a, b) is lowered to VHADDPD xmm_y, xmm_a, xmm_b

x86/x86-64 processors with SSE3 instruction set

  • f32x4.add_pairwise
    • y = f32x4.add_pairwise(a, b) (y is NOT b) is lowered to 'MOVAPS xmm_y, xmm_a+HADDPS xmm_y, xmm_b`
  • f64x2.add_pairwise
    • y = f64x2.add_pairwise(a, b) (y is NOT b) is lowered to MOVAPS xmm_y, xmm_a + HADDPD xmm_y, xmm_b

x86/x86-64 processors with SSE2 instruction set

  • f32x4.add_pairwise
    • y = f32x4.add_pairwise(a, b) (y is NOT b) is lowered to MOVAPS xmm_y, xmm_a + 'MOVAPS xmm_tmp, xmm_a+SHUFPS xmm_y, xmm_b, 0x88+SHUFPS xmm_tmp, xmm_b, 0x88+ADDPS xmm_y, xmm_tmp`
  • f64x2.add_pairwise
    • y = f64x2.add_pairwise(a, b) (y is NOT b) is lowered to MOVAPS xmm_tmp, xmm_b + MOVHLPS xmm_tmp, xmm_a + MOVSD xmm_y, xmm_a + MOVLHPS xmm_y, xmm_b + ADDPD xmm_y, xmm_tmp`

ARM64 processors

  • f32x4.add_pairwise
    • y = f32x4.add_pairwise(a, b) is lowered to FADDP Vy.4S, Va.4S, Vb.4S
  • f64x2.add_pairwise
    • y = f64x2.add_pairwise(a, b) is lowered to FADDP Vy.2D, Va.2D, Vb.2D

ARMv7 processors with NEON instruction set

  • f32x4.add_pairwise
    • y = f32x4.add_pairwise(a, b) is lowered to VPADD.F32 Dy_lo, Da_lo, Db_lo + VPADD.F32 Dy_hi, Da_hi, Db_hi
  • f64x2.add_pairwise
    • y = f64x2.add_pairwise(a, b) is lowered to VADD.F64 Dy_lo, Da_lo, Db_lo + VADD.F64 Dy_hi, Da_hi, Db_hi

@omnisip
Copy link

omnisip commented Oct 7, 2020

@Maratyszcza can I create a pull to your pull so I can add in the integer functionality to complete the operation set and join these two together?

@Maratyszcza
Copy link
Contributor Author

@omnisip I suggest you create a new PR with just the integer instructions.

@omnisip
Copy link

omnisip commented Oct 7, 2020

@omnisip I suggest you create a new PR with just the integer instructions.

This is a little different from the load zero proposal because you're proposing adding instructions that can and should be implemented uniformly on every type -- which seems like there should be precedent for.

@tlively I have no objection to creating a separate proposal for the integer ops, but want to know if I do, whether I should include @Maratyszcza floating point work in it as well.

@tlively
Copy link
Member

tlively commented Oct 8, 2020

@omnisip I think a separate proposal is fine. There's no downside to having them in separate (cross-linked) PRs and that keeps it simpler because the separate PRs will have separate "champions" rather than having two folks pushing for different contents in one PR. Including the fp work in the new PR is probably unnecessary, but this PR should certainly be mentioned in the description of the new PR.

@lemaitre
Copy link

lemaitre commented Oct 9, 2020

I would like to remind to every one that on x86, haddps-based reduction is slower than hierarchical shuffle-based reductions.
The throughput is half (haddps requires 2 shuffles internally) and the latency is 50% higher.
And all examples given could be rewritten more efficiently with plain shuffles.

So here is my question: apart from horizontal sum reduction, is there any code that could benefit from such a pairwise add ?

Because if that's not the case, we would prefer to either introduce complete reductions that would not use haddps, or write plain shuffles in WASM, and improve the code generation of shuffles on existing engines (a shuffle on 32-bit elements should not produce a pshufb).

@Maratyszcza
Copy link
Contributor Author

@lemaitre Not all x86 architectures decompose horizontal instructions into multiple uops, e.g. AMD Jaguar decodes HADDPS into one uop. However, if we reduce multiple SIMD vectors jointly (example), even with decomposition into 3 uops horizontal instructions are no worse than explicit decomposition into 2x shuffle + add, and more efficient in practice due to elimination of MOVAPS instructions and clobbered temporary registers.

@lemaitre
Copy link

lemaitre commented Oct 9, 2020

@lemaitre Not all x86 architectures decompose horizontal instructions into multiple uops, e.g. AMD Jaguar decodes HADDPS into one uop. However, if we reduce multiple SIMD vectors jointly (example), even with decomposition into 3 uops horizontal instructions are no worse than explicit decomposition into 2x shuffle + add, and more efficient in practice due to elimination of MOVAPS instructions and clobbered temporary registers.

Even if it is a single uops on Jaguar (which needs to be confirmed), agner fog's table show a reciprocal throughput of 1 for haddps while a single step of the full reduction (add + shuffle) is also 1.
So in this case, the full reduction will be as fast with both implementations.
But things have changed on Zen, and now the picture is basically the same as on Intel.
And please remember that a majority of desktop is Intel and that you can only do 1 single shuffle per cycle at most on this architecture.

Now, the example with 4 reductions inside a single vector is actually an interesting use-case I never considered.
But even then, you can actually implement it using only 6 shuffles (+3 blendps) that makes it as efficient as the haddps based code (except for code compact-ness, of course).
In this case, for native code, I would most likely use haddps.
But to the question "Should we put haddps to WASM?", I would say it is not worth it as the alternative is as fast (and maybe faster for zen architecture).

The main problem I see is that it will incentivize people to use this pairwise add even when faster alternatives are possible.
Hence my question about other use cases.

@omnisip
Copy link

omnisip commented Oct 9, 2020

@lemaitre Part of this exercise is one in futility. At this point in time Nehalem, the first Intel architecture with two shuffle ports and every instruction set except AVX* is over 11 years old. It really makes me wonder who we're trying to support with sse2. They're going to get bleeding edge Node.JS or Chrome, load WebAssembly and use WebAssembly SIMD on something that probably doesn't support more than 8G of memory tops?

In general, wasm should be forward looking because everything we dream up today will be in silicon in the next 20 years. The likelihood of hadd instructions (which do perform better in at least some cases) being the overall bottleneck is somewhat ridiculous. At the end of the day, it will be application targeted usages that will cause instructions to get performance improvements in future architectures.

(Edits made to clarify and increase accuracy)

@omnisip
Copy link

omnisip commented Oct 9, 2020

On a related topic, we're not limited to using the hardware's version of hadd if we can propose better alternatives under the hood. Remember that WebAssembly on some level is a virtual machine.

@lemaitre
Copy link

lemaitre commented Oct 9, 2020

@lemaitre Part of this exercise is one in futility. At this point in time Nehalem, the first Intel architecture with two shuffle ports and every instruction set except AVX* is over 10 years old. It really makes me wonder who we're trying to support with sse2.

I don't get your point. Current Intel architectures have a single shuffle port (port 5 if you're interested) and is a real bottleneck on multiple algorithms that requires shuffling. In fact, it seems that Nehalem is the only Intel architecture with 2 shuffle ports. To make things even more stupid, element-wise bit shifts use the shuffle port.

I have no problem with optimizing WASM for AVX2 (for instance), and this is definitely not the problem I have with haddps.
It is because most (all?) of its usage actually slow down applications.
Standardize such an instruction will make people use it, and in the end of the day, having slower code than without.

In general, wasm should be forward looking because everything we dream up today will be in silicon in the next 20 years.

I completely agree with this. Actually, experience shows that nothing has been done to accelerate haddps, and AMD even chose to cripple it further on Zen with a reciprocal throughput of 2 cycles whereas a reduction step is ~0.75 cycle (P2 and P3 for ADD, P1 and P2 for shuffle).
I see no reason to think this trend will change in the future.

The likelihood of hadd instructions (which do perform better in at least some cases) being the overall bottleneck is somewhat ridiculous.

If an application can benefit from horizontal add (in whatever form), they will benefit from not using haddps.
In fact, the only rare cases where it haddps is beneficial seem very rare (if any), and on older architectures (Jaguar ?).

On a related topic, we're not limited to using the hardware's version of hadd if we can propose better alternatives under the hood. Remember that WebAssembly on some level is a virtual machine.

I completely agree with this, and fully fledged horizontal add has already been proposed (or at least discussed), but nobody thought that was useful enough.

For me, it is far more beneficial to have low overhead shuffles than having horizontal add. And apparently, WASM engines are not up to the task and don't even try to generate shuffles with immediate instead of pshufb.

@omnisip
Copy link

omnisip commented Oct 9, 2020

@lemaitre Part of this exercise is one in futility. At this point in time Nehalem, the first Intel architecture with two shuffle ports and every instruction set except AVX* is over 10 years old. It really makes me wonder who we're trying to support with sse2.

I don't get your point. Current Intel architectures have a single shuffle port (port 5 if you're interested) and is a real bottleneck on multiple algorithms that requires shuffling. In fact, it seems that Nehalem is the only Intel architecture with 2 shuffle ports. To make things even more stupid, element-wise bit shifts use the shuffle port.

We're both right and wrong on this one. Check out this table from UOPS. It looks like, Sandy Bridge and Ice Lake have 2 shuffle ports, but every generation in between has only 1. Zen+ and Zen2 appear to have 2.

I'm aware of the shuffle bottlenecks, fortunately and unfortunately, so the comment isn't lost on me. I frequently look at these charts to see if something might be implemented like a 'shuffle' including instructions shifts and alignr.

I have no problem with optimizing WASM for AVX2 (for instance), and this is definitely not the problem I have with haddps.
It is because most (all?) of its usage actually slow down applications.
Standardize such an instruction will make people use it, and in the end of the day, having slower code than without.

This will be a topic detailed later today when I put forth my integer hadd proposal, but my interest in horizontal add is because of prefix sum (scan) algorithms and colorspace conversion. Right now, I'm getting bottlenecked by shuffles (or shuffle equivalents) on each.

In general, wasm should be forward looking because everything we dream up today will be in silicon in the next 20 years.

I completely agree with this. [clipped]

Hooray! I'm not the only one. :-)

On a related topic, we're not limited to using the hardware's version of hadd if we can propose better alternatives under the hood. Remember that WebAssembly on some level is a virtual machine.

I completely agree with this, and fully fledged horizontal add has already been proposed (or at least discussed), but nobody thought that was useful enough.

I'm putting forward a proposal for integer horizontal add and horizontal multiply and add later today. Hopefully it'll be feature complete. If you have suggestions on how to improve it or the asm (hadd instruction use or not), please let me know!

For me, it is far more beneficial to have low overhead shuffles than having horizontal add. And apparently, WASM engines are not up to the task and don't even try to generate shuffles with immediate instead of pshufb.
It's not that they're not,

We can take this discussion offline if you'd like. It doesn't appear to be a people problem and nor is this something where someone doesn't care. It's quite the opposite. There are some current implementation challenges that are prohibiting us from getting the best performance. For instance, v128.const isn't well supported on v8 and is still an active issue. Every optimization that could take advantage of a v128 constant from aligned memory is not yet functional.

@lemaitre
Copy link

lemaitre commented Oct 9, 2020

We're both right and wrong on this one. Check out this table from UOPS. It looks like, Sandy Bridge and Ice Lake have 2 shuffle ports, but every generation in between has only 1. Zen+ and Zen2 appear to have 2.

Looking at both agner fog's table and your site (which is nice, btw), Sandy Bridge has a single shuffle port (reciprocal throughput of 1 for shufps), but indeed Ice Lake appears to have 2 ports, at last !
And haddps remains at 2 cycles for Ice Lake, so the situation on Ice Lake is the same as on Zen (even the sharing of ports).

All this means that shuffles will be less of a bottleneck in the future (finally...), but also that haddps is even less interesting because it does not become faster (ie: higher throughput) and remains at 2 cycles while shuffle speed is doubled (a bit less in practice because of the port sharing).

This will be a topic detailed later today when I put forth my integer hadd proposal, but my interest in horizontal add is because of prefix sum (scan) algorithms and colorspace conversion. Right now, I'm getting bottlenecked by shuffles (or shuffle equivalents) on each.

I don't know much about colorspace conversion, but prefix sum in SIMD is close to reduction, so the same tricks would most likely apply.

I'm putting forward a proposal for integer horizontal add and horizontal multiply and add later today.

If by "horizontal multiply and add", you mean dot product, I think there is a proposal on flight, or at least an issue.
You can already have a look at #20 for reduce add discussion.

@omnisip
Copy link

omnisip commented Oct 9, 2020

We're both right and wrong on this one. Check out this table from UOPS. It looks like, Sandy Bridge and Ice Lake have 2 shuffle ports, but every generation in between has only 1. Zen+ and Zen2 appear to have 2.

Looking at both agner fog's table and your site (which is nice, btw), Sandy Bridge has a single shuffle port (reciprocal throughput of 1 for shufps), but indeed Ice Lake appears to have 2 ports, at last !

Sandy has two, but only one is made available to pshufps. Both are made available to pshufd which has reciprocal throughput of 0.5, but I rarely do floating point calculations. Is there a performance penalty for using pshufd or palignr over shufps?

@lemaitre
Copy link

lemaitre commented Oct 9, 2020

Sandy has two, but only one is made available to pshufps. Both are made available to pshufd which has reciprocal throughput of 0.5, but I rarely do floating point calculations. Is there a performance penalty for using pshufd or palignr over shufps?

Wow, Intel always find ways to astonish me with random design choice, even after all those years!
To answer your question: pshufd has no performance penalty per se, but it shuffles a single vector while shufps "half-shuffles" 2 vectors.
However, on old-ish architectures using integer instructions on float data and vice-versa does produce a penalty, but this greatly varies from one architecture to another.
I would need to look back at some documentation to tell if that's still the case on latest platforms.

@omnisip
Copy link

omnisip commented Oct 9, 2020

Wow, Intel always find ways to astonish me with random design choice, even after all those years!
To answer your question: pshufd has no performance penalty per se, but it shuffles a single vector while shufps "half-shuffles" 2 vectors.

Aye. With the penalty for unaligned reads becoming negligible, does it make sense at some point to simply do more than one unaligned load rather than trying to fuss with the middle elements? It's almost certain to be a cache hit if the first load was done on a 64 byte aligned boundary.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants