-
Notifications
You must be signed in to change notification settings - Fork 43
f32x4.add_pairwise and f64x2.add_pairwise #375
base: main
Are you sure you want to change the base?
Conversation
@Maratyszcza can I create a pull to your pull so I can add in the integer functionality to complete the operation set and join these two together? |
@omnisip I suggest you create a new PR with just the integer instructions. |
This is a little different from the load zero proposal because you're proposing adding instructions that can and should be implemented uniformly on every type -- which seems like there should be precedent for. @tlively I have no objection to creating a separate proposal for the integer ops, but want to know if I do, whether I should include @Maratyszcza floating point work in it as well. |
@omnisip I think a separate proposal is fine. There's no downside to having them in separate (cross-linked) PRs and that keeps it simpler because the separate PRs will have separate "champions" rather than having two folks pushing for different contents in one PR. Including the fp work in the new PR is probably unnecessary, but this PR should certainly be mentioned in the description of the new PR. |
I would like to remind to every one that on x86, So here is my question: apart from horizontal sum reduction, is there any code that could benefit from such a pairwise add ? Because if that's not the case, we would prefer to either introduce complete reductions that would not use |
@lemaitre Not all x86 architectures decompose horizontal instructions into multiple uops, e.g. AMD Jaguar decodes |
Even if it is a single uops on Jaguar (which needs to be confirmed), agner fog's table show a reciprocal throughput of 1 for Now, the example with 4 reductions inside a single vector is actually an interesting use-case I never considered. The main problem I see is that it will incentivize people to use this pairwise add even when faster alternatives are possible. |
@lemaitre Part of this exercise is one in futility. At this point in time Nehalem, the first Intel architecture with two shuffle ports and every instruction set except AVX* is over 11 years old. It really makes me wonder who we're trying to support with sse2. They're going to get bleeding edge Node.JS or Chrome, load WebAssembly and use WebAssembly SIMD on something that probably doesn't support more than 8G of memory tops? In general, wasm should be forward looking because everything we dream up today will be in silicon in the next 20 years. The likelihood of hadd instructions (which do perform better in at least some cases) being the overall bottleneck is somewhat ridiculous. At the end of the day, it will be application targeted usages that will cause instructions to get performance improvements in future architectures. (Edits made to clarify and increase accuracy) |
On a related topic, we're not limited to using the hardware's version of hadd if we can propose better alternatives under the hood. Remember that WebAssembly on some level is a virtual machine. |
I don't get your point. Current Intel architectures have a single shuffle port (port 5 if you're interested) and is a real bottleneck on multiple algorithms that requires shuffling. In fact, it seems that Nehalem is the only Intel architecture with 2 shuffle ports. To make things even more stupid, element-wise bit shifts use the shuffle port. I have no problem with optimizing WASM for AVX2 (for instance), and this is definitely not the problem I have with
I completely agree with this. Actually, experience shows that nothing has been done to accelerate
If an application can benefit from horizontal add (in whatever form), they will benefit from not using
I completely agree with this, and fully fledged horizontal add has already been proposed (or at least discussed), but nobody thought that was useful enough. For me, it is far more beneficial to have low overhead shuffles than having horizontal add. And apparently, WASM engines are not up to the task and don't even try to generate shuffles with immediate instead of |
We're both right and wrong on this one. Check out this table from UOPS. It looks like, Sandy Bridge and Ice Lake have 2 shuffle ports, but every generation in between has only 1. Zen+ and Zen2 appear to have 2. I'm aware of the shuffle bottlenecks, fortunately and unfortunately, so the comment isn't lost on me. I frequently look at these charts to see if something might be implemented like a 'shuffle' including instructions shifts and alignr.
This will be a topic detailed later today when I put forth my integer hadd proposal, but my interest in horizontal add is because of prefix sum (scan) algorithms and colorspace conversion. Right now, I'm getting bottlenecked by shuffles (or shuffle equivalents) on each.
Hooray! I'm not the only one. :-)
I'm putting forward a proposal for integer horizontal add and horizontal multiply and add later today. Hopefully it'll be feature complete. If you have suggestions on how to improve it or the asm (hadd instruction use or not), please let me know!
We can take this discussion offline if you'd like. It doesn't appear to be a people problem and nor is this something where someone doesn't care. It's quite the opposite. There are some current implementation challenges that are prohibiting us from getting the best performance. For instance, v128.const isn't well supported on v8 and is still an active issue. Every optimization that could take advantage of a v128 constant from aligned memory is not yet functional. |
Looking at both agner fog's table and your site (which is nice, btw), Sandy Bridge has a single shuffle port (reciprocal throughput of 1 for All this means that shuffles will be less of a bottleneck in the future (finally...), but also that
I don't know much about colorspace conversion, but prefix sum in SIMD is close to reduction, so the same tricks would most likely apply.
If by "horizontal multiply and add", you mean dot product, I think there is a proposal on flight, or at least an issue. |
Sandy has two, but only one is made available to pshufps. Both are made available to pshufd which has reciprocal throughput of 0.5, but I rarely do floating point calculations. Is there a performance penalty for using pshufd or palignr over shufps? |
Wow, Intel always find ways to astonish me with random design choice, even after all those years! |
Aye. With the penalty for unaligned reads becoming negligible, does it make sense at some point to simply do more than one unaligned load rather than trying to fuss with the middle elements? It's almost certain to be a cache hit if the first load was done on a 64 byte aligned boundary. |
Introduction
Pairwise addition (AKA horizontal addition) is a SIMD operation that computed sums within adjacent pairs of lanes in the concatenation of two SIMD registers. Floating-point pairwise addition is supported on both x86 (since SSE3) and ARM (since NEON) and comprise a commonly used primitive for full and partial sums on SIMD vectors.
Besides pairwise addition, x86 supports pairwise subtraction. ARM doesn't support pairwise subtraction, but offers pairwise maximum and minimum operation. Pairwise subtraction, minimum, and maximum were left out of this PR as each of them would benefit only a single architecture.
Applications
Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX instruction set
y = f32x4.add_pairwise(a, b)
is lowered toVHADDPS xmm_y, xmm_a, xmm_b
y = f64x2.add_pairwise(a, b)
is lowered toVHADDPD xmm_y, xmm_a, xmm_b
x86/x86-64 processors with SSE3 instruction set
y = f32x4.add_pairwise(a, b)
(y
is NOTb
) is lowered to 'MOVAPS xmm_y, xmm_a+
HADDPS xmm_y, xmm_b`y = f64x2.add_pairwise(a, b)
(y
is NOTb
) is lowered toMOVAPS xmm_y, xmm_a
+HADDPD xmm_y, xmm_b
x86/x86-64 processors with SSE2 instruction set
y = f32x4.add_pairwise(a, b)
(y
is NOTb
) is lowered toMOVAPS xmm_y, xmm_a
+ 'MOVAPS xmm_tmp, xmm_a+
SHUFPS xmm_y, xmm_b, 0x88+
SHUFPS xmm_tmp, xmm_b, 0x88+
ADDPS xmm_y, xmm_tmp`y = f64x2.add_pairwise(a, b)
(y
is NOTb
) is lowered toMOVAPS xmm_tmp, xmm_b
+MOVHLPS xmm_tmp, xmm_a
+MOVSD xmm_y, xmm_a
+MOVLHPS xmm_y, xmm_b
+ ADDPD xmm_y, xmm_tmp`ARM64 processors
y = f32x4.add_pairwise(a, b)
is lowered toFADDP Vy.4S, Va.4S, Vb.4S
y = f64x2.add_pairwise(a, b)
is lowered toFADDP Vy.2D, Va.2D, Vb.2D
ARMv7 processors with NEON instruction set
y = f32x4.add_pairwise(a, b)
is lowered toVPADD.F32 Dy_lo, Da_lo, Db_lo
+VPADD.F32 Dy_hi, Da_hi, Db_hi
y = f64x2.add_pairwise(a, b)
is lowered toVADD.F64 Dy_lo, Da_lo, Db_lo
+VADD.F64 Dy_hi, Da_hi, Db_hi