-
Notifications
You must be signed in to change notification settings - Fork 43
Alternative to Swizzle / Shuffle #8
Comments
The discussion of these operations predates my involvement with the SIMD.js project. Perhaps @sunfishcode or @PeterJensen can remember the rationale. We have implemented these operations in Firefox with an SSE 2 baseline. We use We have some experience with this representation of shuffles from LLVM IR which does more or less the same thing as this proposal. C code using Intel intrinsics is translated into abstract I think the rationale behind the abstract shuffle representation is that you can start by implementing the catch-all fallback with I think it would be a good idea to develop a list of patterns that should be fast in a good implementation, and we shouldn't expect implementations to provide optimal instruction sequences for all patterns. That would indeed be a big burden. |
Here's a bit of history: We used to only have one shuffle operation that would work on 32x4 vectors. We provided the shuffle indices as a mask with constants for each of the 256 possible combinations (SIMD.XXXX, SIMD.XXXY, SIMD.XXXZ, etc). This approach wasn't scalable when the 16x8 and 8x16 types were introduced. Also, in order to take advantage of two-input shuffle instructions (e.g. shufps), we changed the original shuffle to be swizzle and the shuffle was changed to take two inputs. There's a couple of discussion threads about this here: tc39/ecmascript_simd#74 Because of the explosion of possible shuffle masks with 16x8 and 8x16 types and the desire to still have the code be readable we went with the 'list-of-indices' parameter approach instead. IMO, this is a pretty nice low-level 'abstraction'. It covers a lot (all?) of different 'shuffle-like' instructions in both SSE and NEON. I think it's preferable to keep this abstraction in favor of exposing 'frequently used' shuffle pattern operations like the ones mentioned above. I don't think it's 'too much' for a wasm translator to handle. Picking the right instruction(s) based on the combination of shuffle indices shouldn't be a time consuming operation. It might be a fairly large chunk of code, but not time consuming to execute. Jacob also points out that the pshufb instruction is available for a quick initial implementation (I think NEON has a similar one (VTBL?)). When more shuffle patterns emerge as being important for performance, or newer architectures favors other instruction combinations for fixed shuffle patterns, it becomes a matter of adjusting the translator, not a matter of introducing new wasm SIMD opcodes. Isn't one of the goals of WASM to have it be as 'target-independent' as possible? Not sure exactly why there's such a huge amount of code in LLVM that deals with picking the right instruction sequences. I did notice that the clang version of xmmintrin.h does use the __builtin_shufflevector intrinsic to implement other intrinsics, so the compiler will have to de-abstract back from shufflevector() to the instruction that matched the original user specified intrinsic. Maybe that's what Jacob mentioned that got the SIMD intrinsics developers miffed :) Anyways, just my two cents. |
Hi Peter, thanks for the comments. On ARM, vtbl is indeed available for arbitrary shuffles. However, I think it's relatively expensive - it only operates on half of a SIMD register (at least for 32 bit ARMv7 CPUs). Also, the shuffle mask is not an immediate operand, and must be materialized as a constant (either through 4 mov immediates into general purpose registers and a vmov into the d-register, or 64 bits loaded from constant pool memory into a d-register. LLVM seems to try very hard to avoid vtbl. So I see two main disadvantages to the general API:
I realize that this penalizes ISAs that have support for fully general shuffles. I'd reconsider if others can confirm ways around the performance cliff, or that vtbl isn't that bad on ARM. There are many attractive platform-specific capabilities that I would also argue shouldn't be part of a portable SIMD feature.
|
The other consideration here is the producer side. In SIMD.js, many of our users were porting SIMD code from other platforms and/or using APIs that already offer fully general shuffles (clang C/C++ SIMD extensions, GCC C/C++ SIMD extensions, OpenCL, SPIR-V, are all developer-facing portable APIs that offer fully general shuffles). If we only give them a restricted set of shuffles, they'll end up synthesizing the shuffles they need out of multiple shuffle operations. Then, to avoid generating even worse code, engines would be obliged to recombine the shuffles, and end up facing the same fully-general shuffle problem anyway. I'm sympathetic to the performance cliff argument. Fully general shuffles create some cliffs, and I'd be very interested in ideas for solving this. Unfortunately, restricting the set of shuffles wasm offers doesn't appear to solve it, and it makes other things more complex. I agree about strided accesses and fma at this time; fortunately the proposal here already omits those. Reciprocal and reciprocal sqrt are more nuanced; #3 is already opened to discuss them. |
I can see a few ways to satisfy these different approaches.
In either case, general swizzle and shuffle could be added to avoid compiler code generation problems, though they could introduce a performance cliff. |
I looked up some performance data just to get an idea what we're talking about. Data sources:
In the tables below I compare the fully general shuffle instructions to a few of the fast special-purpose shuffles provided by the ISA. The numbers are "Latency" x "Throughput". ARMv7NEON provides general shuffles with the
*) The guides don't make it clear if the On these micro-architectures, a general 128-bit swizzle executes with the same latency and throughput as a single fast shuffle instruction once you have the control word loaded into a D-register. A general shuffle has twice the latency of a fast instruction and the same throughput. ARMv8 AArch64In 64-bit mode, the
In this mode, a general swizzle has 2x the latency of a fast instruction and a general shuffle is 3x the latency. Throughput numbers are incomplete. My guess is they are the same. IntelWith SSSE3 available, general shuffles and swizzles are provided by the
Skylake seems to do all its shuffles on port 5, so shuffle throughput is never more than 1. It does ConclusionAll of the important SIMD ISAs provide instructions that can be used to synthesize fully general swizzles and shuffles. (MIPS MSA has a set of The common cost that everybody pays is that these instructions take their shuffle mask in a vector register rather than as an immediate operand. This cost is harder to quantify; it depends on the context. If enough vector registers are available, these constants can be materialized outside the inner loop which makes them effectively free. If not, a constant pool load is not on the critical path and usually not a big deal for an out-of-order design. It all depends, of course. There is no doubt that the general shuffle instructions are less performant than the various special-purpose instructions provided by the SIMD ISAs. However, when used as the catch-all fallbacks they establish a pretty good performance baseline that implementations can improve on. I think the talk of performance cliffs is overstated. It's more like a performance crack in the pavement. These performance numbers make me a bit skeptical about LLVM's perfect shuffle tables. This doesn't help either: unsigned PFEntry = PerfectShuffleTable[PFTableIndex];
unsigned Cost = (PFEntry >> 30);
if (Cost <= 4)
return GeneratePerfectShuffle(PFEntry, V1, V2, DAG, dl);
}
return GenerateTBL(Op, ShuffleMask, DAG);
} What kind of science was used to determine that a LLVM's x86 backend seems to have received more shuffle-related work recently (and more science). The code is complicated, but the general approach is to prefer With the general shuffles in WebAssembly, my advice to implementers would be an incremental approach:
We can also provide a non-normative list of shuffle patterns that should be fast, starting from the operations @billbudge proposes here. Implementers can use that as a guide to step 3. |
@stoklund Very nice writeup, Jakob! Thank you! PS: I also thought the perf cliff thing was more like a crack :) Thank you for supplying data to support that. |
@billbudge I would recommend your 1) suggestion. The code size savings you get by 2) is probably not worth the hazzle of dealing with extra opcodes. (Aren't you running out of opcodes? I thought I read something about that). In any case, for modules that make use of shuffles the actual static amount of those is most likely insignificant compared to the total size of the module. |
@billbudge, I haven't paid much attention to code size in the proposal, and the 16-lane swizzle and shuffle instructions currently encode their 16 lane indexes in 16 immediate bytes. I am expecting SIMD instructions to be concentrated in a few functions that are hot in terms of execution time, but only a small fraction of the total program's code size. Given the availability and performance of the general shuffle instructions, I am actually tempted to make WebAssembly's shuffle and swizzle instructions more general by permitting out-of-range immediate lane indexes. Such lanes would be zeroes in the output. This behavior is supported by all platforms. I'll file a separate issue for that. |
Thanks for the detailed survey @stoklund - I'm currently concerned about ARMv7 code generation for this feature. You left out the cost of materializing the shuffle mask in a SIMD register since it is hard to quantify. In theory, it can be hoisted out of loops. We don't currently do such optimizations in our translator but after asking around, it may be possible to get our register allocator to give a similar effect. Of course, these shuffle masks may spill and have to be reloaded if there is enough register pressure, so I think it's still preferable to use the primitive permute opcodes when we can (mostly on ARM). Chrome's media team has told us that SIMD loops that spill definitely hurts performance. WebAssembly abstracts away the physical registers so I think this is a significant problem for us. I agree that shuffle instruction size is not a real problem. I also agree that we should support general swizzles and shuffles so compilers can generate something. I don't think we'll really know how much this matters until we have some real client code. More generally, in my own web searches, I have found almost no code written to use portable SIMD intrinsics. Has there been a successful portable SIMD design? Everyone we've talked to uses platform-specific SIMD intrinsics. I don't think it's a given that we'll pass Apple's bar of a large enough performance improvement for a real world component (like a video codec) across multiple target platforms to justify this large addition to WebAssembly. When I see us generating more than one or two instructions to implement a SIMD opcode I get nervous. I've also been meaning to file an issue about some other opcodes (divide, square root) that I think will be slow on ARM. I'll file that soon. |
@billbudge as far as a 'successful portable SIMD design'... I think the C++ standards committee (WG21) is considering Matthias Kretz's proposal: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0214r0.pdf In case you're interested. It probably doesn't qualify for being a successful portable, since it's not widely used yet. At least it seems like a valiant attempt to rid the world of ugly target specific use of SIMD intrinsics :) Not sure where it is in the C++ standardization process. He talks about the need for shuffle operations in the intro, but AFAICT there's no actual proposal for an API. |
@PeterJensen Matthias' proposal is fairly advanced, and I expect to see it in C++20 or at least in a TS by then. That being said, the code as written is "portable" but has the ABI just under the surface. It isn't portable in the WebAssembly sense (it doesn't map to opcodes proposed here). That proposal also punts on shuffles. BTW this is the latest revision: http://wg21.link/p0214 |
@billbudge, yes definitely use the special-purpose NEON shuffle instructions when possible. They are likely to be faster. This goes for Intel too. For example, I agree that WebAssembly's abstracting away the physical registers is a problem. So is the diversity of pipeline designs for SIMD execution. The two Cortex-A micro-architectures I looked at above have 3-cycle latency minimum for even the most trivial integer SIMD operations and 5 cycles for floating-point. But at least in 64-bit mode you get 32 registers which helps with the pipelining. On the other hand, Intel's designs tend to have short latencies but in 32-bit mode you only have 8 XMM registers. It is difficult to write SIMD code that performs well on those two design points simultaneously. I think our best bet is to depend on the out-of-order execution to overlap the execution of different iterations of the same loop. That way you can saturate the pipelines without using a ton of registers on something like software pipelining which is likely to cause spilling. This means that loop bodies should be short so many iterations fit in the instruction window, and loop-carried dependencies should be avoided or kept simple like in a reduction. This is easier said that done, of course. OpenCL was a fairly successful attempt at a language with portable SIMD operations, but I don't know if it saw a lot of use for generating SIMD code running on a CPU. Most uses were for GPU code. I am looking forward to seeing the results of your efforts. I don't think the outcome is a given either. |
What about a NxN transpose that can't be expressed as a single 2-operand operator? It seems like it would be much easier to use a special NxN transpose operator, as well as to generate good code for one. |
Swizzle(mask, x) is equivalent to Shuffle(mask, x, x). Do we need Swizzle if we have Shuffle? |
Technically, no. LLVM only has a single As we have discussed here, WebAssembly consumers are expected to take a closer look at the lane indices anyway, so asking them to recognize swizzles as shuffles where all the lane indices are in the first vector is not a big deal. |
It might be an old issue, but I think it is still relevant, so here is what I think: It might be interesting to have a variant with only 1 vector to shuffle just to simplify the generic case. |
For those who are interested, I created a merge request about this: #30 |
Closing this issue as the current proposal has generic shuffles and swizzles included. |
These operations will be very difficult to implement efficiently.
Even on Intel (SSE, AVX) which has native shuffle instructions, the code generation for this is very complex. Here is the Intel instruction selection lowering in LLVM. There is quite a lot of code to implement the shuffle intrinsic:
X86ISelLowering.cpp
The ARM version is worse, since many cases require use of 2 'vtbl' instructions, in addition to materializing the two constant shuffle masks in d-registers. LLVM uses a pre-generated table of 26K entries to generate fast instruction sequences for the 32x4 shuffles.
ARMISelLowering.cpp
This is a lot of complexity for any compiler, and too much for WebAssembly translators. Most of these swizzles and shuffles will never be used. It is a hazard to provide this feature if we can't guarantee that all shuffles will be fast on all platforms.
An alternative is to implement a small set of primitive permutations that we know can be implemented efficiently, without lots of work in the translator. I think these should cover most real-world cases, and can be composed by the programmer or compiler for other shuffles. By being similar to a real ISA, it should also be straightforward to modify toolchains to support WASM SIMD.
I'm recommending a set along the lines of the ARM permutation instructions:
Additionally, we may want to have shuffles that reverse the lanes in various patterns like the ARM vrev instructions.
On Intel these can be implemented using the pshuf instructions. We're assuming SSE 4.1 as a baseline for SIMD support right now. POWER and MIPS have similar primitive shuffles.
The text was updated successfully, but these errors were encountered: